Introduction To Neural Networks

Introduction To
Neural Networks
These slides are largely borrowed from George Papadourakis,
Prévotet Jean-Christophe
Introduction

What are Neural Networks?



Neural networks are a new method of programming computers.
They are exceptionally good at performing pattern recognition
and other tasks that are very difficult to program using
conventional techniques.
Programs that employ neural nets are also capable of learning
on their own and adapting to changing conditions.
Background




Development of Neural Networks date back to the early 1940s. It experienced an
upsurge in popularity in the late 1980s. This was a result of the discovery of new
techniques and developments and general advances in computer hardware
technology.
Some NNs are models of biological neural networks and some are not, but
historically, much of the inspiration for the field of NNs came from the desire to
produce artificial systems capable of sophisticated, perhaps intelligent, computations
similar to those that the human brain routinely performs, and thereby possibly to
enhance our understanding of the human brain.
Most NNs have some sort of training rule. In other words, NNs learn from examples
(as children learn to recognize dogs from examples of dogs) and exhibit some
capability for generalization beyond the training data.
Neural computing must not be considered as a competitor to conventional computing.
Rather, it should be seen as complementary as the most successful neural solutions
have been those which operate in conjunction with existing, traditional techniques.
Background (Cont.)



An Artificial Neural Network (ANN) is an information processing
paradigm that is inspired by the biological nervous systems, such as
the human brain’s information processing mechanism.
The key element of this paradigm is the novel structure of the
information processing system. It is composed of a large number of
highly interconnected processing elements (neurons) working in
unison to solve specific problems. NNs, like people, learn by
example.
An NN is configured for a specific application, such as pattern
recognition or data classification, through a learning process.
Learning in biological systems involves adjustments to the synaptic
connections that exist between the neurons. This is true of NNs as
well.
Biological Neuron



In the human brain, a typical neuron collects signals from others
through a host of fine structures called dendrites.
The neuron sends out spikes of electrical activity through a long,
thin stand known as an axon, which splits into thousands of
branches.
At the end of each branch, a structure called a synapse converts the
activity from the axon into electrical effects that inhibit or excite
activity in the connected neurons.
How the Human Brain learns


The brain is a collection of about 10 billion interconnected neurons. Each
neuron is a cell that uses biochemical reactions to receive, process and
transmit information.
A neuron's dendritic tree is connected to a thousand neighbouring neurons.
When one of those neurons fire, a positive or negative charge is received by
one of the dendrites. The strengths of all the received charges are added
together through the processes of spatial and temporal summation.
Biological inspirations

Some numbers…
 The
human brain contains about 10 billion nerve cells
(neurons)
 Each neuron is connected to the others through
10000 synapses

Properties of the brain
 It
can learn, reorganize itself from experience
 It adapts to the environment
 It is robust and fault tolerant
Biological neuron
synapse
axon
nucleus
cell body
dendrites

A neuron has




A branching input (dendrites)
A branching output (the axon)
The information circulates from the dendrites to the axon
via the cell body
Axon connects to dendrites via synapses


Synapses vary in strength
Synapses may be excitatory or inhibitory
A Neuron Model

When a neuron receives excitatory input that is sufficiently large
compared with its inhibitory input, it sends a spike of electrical activity
down its axon. Learning occurs by changing the effectiveness of the
synapses so that the influence of one neuron on another changes.

We conduct these neural networks by first trying to deduce the essential
features of neurons and their interconnections.
We then typically program a computer to simulate these features.

A Simple Neuron


An artificial neuron is a device with many inputs and one
output.
The neuron has two modes of operation; the training
mode and the using mode.
A Simple Neuron (Cont.)



In the training mode, the neuron can be trained to fire (or not), for
particular input patterns.
In the using mode, when a taught input pattern is detected at the
input, its associated output becomes the current output. If the input
pattern does not belong in the taught list of input patterns, the firing
rule is used to determine whether to fire or not.
The firing rule is an important concept in neural networks and
accounts for their high flexibility. A firing rule determines how one
calculates whether a neuron should fire for any input pattern. It
relates to all the input patterns, not only the ones on which the node
was trained on previously.
Neural Network Techniques


Computers have to be explicitly programmed

Analyze the problem to be solved.

Write the code in a programming language.
Neural networks learn from examples




No requirement of an explicit description of the problem.
No need for a programmer.
The neural computer adapts itself during a training period, based on
examples of similar problems even without a desired solution to each
problem. After sufficient training the neural computer is able to relate the
problem data to the solutions, inputs to outputs, and it is then able to
offer a viable solution to a brand new problem.
Able to generalize or to handle incomplete data.
NNs vs Computers
Digital Computers

Deductive Reasoning. We apply
known rules to input data to produce
output.

Computation is centralized,
synchronous, and serial.
 Memory is packetted, literally stored,
and location addressable.

Not fault tolerant. One transistor goes
and it no longer works.

Exact.

Static connectivity.

Applicable if well defined rules with
precise input data.
Neural Networks

Inductive Reasoning. Given input and
output data (training examples), we
construct the rules.

Computation is collective,
asynchronous, and parallel.

Memory is distributed, internalized,
short term and content addressable.

Fault tolerant, redundancy, and
sharing of responsibilities.

Inexact.

Dynamic connectivity.

Applicable if rules are unknown or
complicated, or if data are noisy or
partial.
Applications


classification
in marketing: consumer spending pattern classification
In defence: radar and sonar image classification
In agriculture & fishing: fruit and catch grading
In medicine: ultrasound and electrocardiogram image
classification, EEGs, medical diagnosis
recognition and identification
In general computing and telecommunications: speech,
vision and handwriting recognition
In finance: signature verification and bank note verification
Applications (Cont.)


assessment
In engineering: product inspection monitoring and control
In defence: target tracking
In security: motion detection, surveillance image analysis and
fingerprint matching
forecasting and prediction
In finance: foreign exchange rate and stock market forecasting
In agriculture: crop yield forecasting
In marketing: sales forecasting
In meteorology: weather prediction
What can you do with an NN and
what not?



In principle, NNs can compute any computable function, i.e., they
can do everything a normal digital computer can do. Almost any
mapping between vector spaces can be approximated to arbitrary
precision by feedforward NNs.
In practice, NNs are especially useful for classification and function
approximation problems usually when rules such as those that
might be used in an expert system cannot easily be applied.
NNs are, at least today, difficult to apply successfully to problems
that concern manipulation of symbols and memory. And there are no
methods for training NNs that can magically create information that
is not contained in the training data.
Who is concerned with NNs?








Computer scientists want to find out about the properties of non-symbolic
information processing with neural nets and about learning systems in
general.
Statisticians use neural nets as flexible, nonlinear regression and
classification models.
Engineers of many kinds exploit the capabilities of neural networks in many
areas, such as signal processing and automatic control.
Cognitive scientists view neural networks as a possible apparatus to
describe models of thinking and consciousness (High-level brain function).
Neuro-physiologists use neural networks to describe and explore mediumlevel brain function (e.g. memory, sensory system, motorics).
Physicists use neural networks to model phenomena in statistical
mechanics and for a lot of other tasks.
Biologists use Neural Networks to interpret nucleotide sequences.
Philosophers and some other people may also be interested in Neural
Networks for various reasons
Slide 17
Pattern Recognition





An important application of neural networks is pattern recognition.
Pattern recognition can be implemented by using a feed-forward
neural network that has been trained accordingly.
During training, the network is trained to associate outputs with input
patterns.
When the network is used, it identifies the input pattern and tries to
output the associated output pattern.
The power of neural networks comes to life when a pattern that has
no output associated with it, is given as an input.
In this case, the network gives the output that corresponds to a
taught input pattern that is least different from the given pattern.
Pattern Recognition (cont.)

Suppose a network is trained to recognize
the patterns T and H. The associated
patterns are all black and all white
respectively as shown above.
Pattern Recognition (cont.)
Since the input pattern looks more like a ‘T’, when the
network classifies it, it sees the input closely resembling
‘T’ and outputs the pattern that represents a ‘T’.
Pattern Recognition (cont.)
The input pattern here closely resembles
‘H’ with a slight difference. The network in
this case classifies it as an ‘H’ and outputs
the pattern representing an ‘H’.
Pattern Recognition (cont.)




Here the top row is 2 errors away from a ‘T’ and 3 errors away from
an H. So the top output is a black.
The middle row is 1 error away from both T and H, so the output is
random.
The bottom row is 1 error away from T and 2 away from H.
Therefore the output is black.
Since the input resembles a ‘T’ more than an ‘H’ the output of the
network is in favor of a ‘T’.
Perceptron Learning Algorithm
First neural network learning model in the
1960’s
 Simple and limited (single layer models)
 Basic concepts are similar for multi-layer
models so this is a good learning tool
 Still used in many current applications
(modems, etc.)

Perceptron Node – Threshold
Logic Unit
x1
w
1
x2
w
q
z
2
xn
w
n
n
1
if
x w 
i 1
i
i
z
n
0
if
x w 
i 1
i
i
Perceptron Node – Threshold
Logic Unit
x1
w
1
x2
w
q
z
2
xn
w
n
n
• Learn weights such that an objective
function is maximized.
• What objective function should we
use?
• What learning algorithm should we
use?
1
if
åx w ³q
i
i
i =1
z=
n
0
if
åx w <q
i
i =1
i
Perceptron Learning Algorithm
x1
.4
.1
x2
-.2
n
x1 x2 t
.8 .3 1
.4 .1 0
1
if
åx w ³q
i
i
i =1
z=
n
0
if
åx w <q
i
i =1
i
z
First Training Instance
.8
.4
.1
.3
-.2
.8 .3 1
.4 .1 0
net = .8*.4 + .3*-.2 = .26
n
x1 x2 t
1
if
åx w ³q
i
i
i =1
z=
n
0
if
z =1
åx w <q
i
i =1
i
Second Training Instance
.4
.4
.1
.1
-.2
.4 .1 0
net = .4*.4 + .1*-.2 = .14
n
x1 x2 t
.8 .3 1
z =1
1
if
åx w ³q
i
i
i =1
z=
n
0
if
åx w <q
i
i =1
i
Dwi = c(t – z) xi
Perceptron Rule Learning
Dwi = c(t – z) xi


Where wi is the weight from input i to perceptron node, c is the
learning rate, tj is the target for the current instance, z is the current
output, and xi is ith input
Least perturbation principle








Only change weights if there is an error
small c rather than changing weights sufficient to make current pattern
correct
Scale by xi
Create a perceptron node with n inputs
Iteratively apply a pattern from the training set and apply the
perceptron rule
Each iteration through the training set is an epoch
Continue training until total training set error ceases to improve
Perceptron Convergence Theorem: Guaranteed to find a solution
in finite time if a solution exists
Augmented Pattern Vectors
1 0 1 -> 0
1 0 0 -> 1
Augmented Version
1 0 1 1 -> 0
1 0 0 1 -> 1
 Treat threshold like any other weight. No special case.
Call it a bias since it biases the output up or down.
 Since we start with random weights anyways, can ignore
the - notion, and just think of the bias as an extra
available weight. (note the author uses a -1 input)
 Always use a bias weight
Perceptron Rule Example



Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set
0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
Pattern
001 1
Target
0
Weight Vector
0000
Net
Output
DW
Example



Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set
0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
Pattern
001 1
111 1
Target
0
1
Weight Vector
0000
0000
Net
0
Output
0
DW
0 0 0 0
Example



Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set
0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
Pattern
001 1
111 1
101 1
Target
0
1
1
Weight Vector
0000
0000
1111
Net
0
0
Output
0
0
DW
0 0 0 0
1 1 1 1
Example



Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set
0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
Pattern
001 1
111 1
101 1
011 1
Target
0
1
1
0
Weight Vector
0000
0000
1111
1111
Net
0
0
3
Output
0
0
1
DW
0 0 0 0
1 1 1 1
0 0 0 0
Example



Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set
0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
Pattern
001 1
111 1
101 1
011 1
001 1
Target
0
1
1
0
0
Weight Vector
0000
0000
1111
1111
1000
Net
0
0
3
3
Output
0
0
1
1
DW
0 0 0 0
1 1 1 1
0 0 0 0
0 -1 -1 -1
Example



Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set
0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
Pattern
001 1
111 1
101 1
011 1
001 1
111 1
101 1
011 1
Target
0
1
1
0
0
1
1
0
Weight Vector
0000
0000
1111
1111
1000
1000
1000
1000
Net
0
0
3
3
0
1
1
0
Output
0
0
1
1
0
1
1
0
DW
0 0 0 0
1 1 1 1
0 0 0 0
0 -1 -1 -1
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
If no bias then the
hyperplane must
go through the
origin
Linear Separability
Linear Separability and Generalization
When is data noise vs. a legitimate exception
Limited Functionality of Hyperplane
How to Handle Multi-Class Output


This is an issue with any learning model which only
supports binary classification (perceptron, SVM, etc.)
Create 1 perceptron for each output class, where the
training set considers all other classes to be negative
examples



Run all perceptrons on novel data and set the output to the
class of the perceptron which outputs high
If there is a tie, choose the perceptron with the highest net
value
Create 1 perceptron for each pair of output classes, where
the training set only contains examples from the 2 classes



Run all perceptrons on novel data and set the output to be
the class with the most wins (votes) from the perceptrons
In case of a tie, use the net values to decide
Number of models grows by the square of the output classes
Objective Functions: Accuracy/Error


How do we judge the quality of a particular model (e.g.
Perceptron with a particular setting of weights)
Consider how accurate the model is on the data set




Classification accuracy = # Correct/Total instances
Classification error = # Misclassified/Total instances (= 1 –
acc)
Usually minimize a Loss function (aka cost, error)
For real valued outputs and/or targets

Pattern error = Target – output




Errors could cancel each other: S|ti – zi| (L1 loss)
Common approach is Squared Error = S(ti – zi)2 (L2 loss)
Total sum squared error = S Pattern Errors = S S (ti – zi)2
For nominal data, pattern error is typically 1 for a
mismatch and 0 for a match

For nominal (including binary) output and targets, SSE and
classification error are equivalent
Mean Squared Error


Mean Squared Error (MSE) – SSE/n where n is the number of
instances in the data set
 This can be nice because it normalizes the error for data sets of
different sizes
 MSE is the average squared error per pattern
Root Mean Squared Error (RMSE) – is the square root of the MSE
 This puts the error value back into the same units as the features
and can thus be more intuitive
 Since we squared the error on the SSE
 RMSE is the average distance (error) of targets from the outputs
in the same scale as the features
Perceptron Learning Theorem

A perceptron (threshold unit) can learn
anything that it can represent (i.e.
anything separable with a hyperplane)
The Exclusive OR problem
A Perceptron cannot represent Exclusive
OR since it is not linearly separable.
Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of
Units. Piecewise linear classification using an MLP with
threshold (perceptron) units
+1
1
3
2
+1
48
Linear Models which are Non-Linear in the Input Space
n

So far we have used
f (x,w) = sign(å wi x i )
1=1

We could preprocess the inputs in a non-linear way and
do
m
f (x,w) = sign(å wif i (x ))
1=1



To the perceptron algorithm it looks just the same and can
use the same learning algorithm, it just has different inputs
- SVM
For example, for a problem with two inputs x and y (plus
the bias), we could also add the inputs x2, y2, and x·y
The perceptron would just think it is a 5 dimensional task,
and it is linear in those 5 dimensions

But what kind of decision surfaces would it allow for the
original 2-d input space?
Quadric Machine





All quadratic surfaces (2nd order)
 ellipsoid
 parabola
 etc.
That significantly increases the number of problems that can be
solved
But still many problem which are not quadrically separable
Could go to 3rd and higher order features, but number of possible
features grows exponentially
Multi-layer neural networks will allow us to discover high-order
features automatically from the input space
Simple Quadric Example
-3 -2 -1
0 1
2
f1
Perceptron with just feature f1 cannot
separate the data
 Could we add a transformed feature to
our perceptron?

Simple Quadric Example
-3 -2 -1
0 1
2
f1
Perceptron with just feature f1 cannot
separate the data
 Could we add a transformed feature to
our perceptron?
 f2 = f12

Simple Quadric Example
f2
-3 -2 -1
0 1
f1



2
-3 -2 -1
0 1
2
f1
Perceptron with just feature f1 cannot separate the
data
Could we add another feature to our perceptron f2 =
f 12
Note could also think of this as just using feature f1
but now allowing a quadric surface to separate the
data
The Key Elements of Neural Networks

Neural computing requires a number of neurons, to be connected together
into a neural network. Neurons are arranged in layers.
Inputs
p1
Weights
w1
w2
p2
f
w3
a
Output
p3
1
a  f

(p1 w1 
Bias
p 2 w2  p3 w3  b   f
( p w
i
i
 b
Each neuron within the network is usually a simple processing unit which
takes one or more inputs and produces an output. At each neuron, every
input has an associated weight which modifies the strength of each input.
The neuron simply adds together all the inputs and calculates an output to
be passed on.
Three-layer networks
x1
x2
Input
Output
xn
Hidden layers
Properties of architecture
• No connections within a layer
Each unit is a perceptron
m
yi  f ( wij x j  bi )
j 1
Properties of architecture
• No connections within a layer
• No direct connections between input and output layers
Each unit is a perceptron
m
yi  f ( wij x j  bi )
j 1
Properties of architecture
• No connections within a layer
• No direct connections between input and output layers
• Fully connected between layers
•
Each unit is a perceptron
m
yi  f ( wij x j  bi )
j 1
Properties of architecture
• No connections within a layer
• No direct connections between input and output layers
• Fully connected between layers
• Often more than 3 layers
• Number of output units need not equal number of input units
• Number of hidden units per layer can be more or less than
input or output units
Each unit is a perceptron
m
yi  f ( wij x j  bi )
j 1
Often include bias as an extra weight
What do each of the layers do?
1st layer draws linear
boundaries
2nd layer combines the
boundaries
3rd layer can generate
arbitrarily complex boundaries
Feed Forward Neural Networks

Output layer

2nd hidden
layer
1st hidden
layer

x1
x2
…..
xn
The information is
propagated from the
inputs to the outputs
Computations of No non
linear functions from n
input variables by
compositions of N
algebraic functions
Time has no role (NO
cycle between outputs
and inputs)
Learning

The procedure that consists in estimating the parameters of neurons
so that the whole network can perform a specific task

2 types of learning



The supervised learning
The unsupervised learning
The Learning process (supervised)



Present the network a number of inputs and their corresponding outputs
See how closely the actual outputs match the desired ones
Modify the parameters to better approximate the desired outputs
Supervised learning
The desired response of the neural
network in function of particular inputs is
well known.
 A “Professor” may provide examples and
teach the neural network how to fulfill a
certain task

Unsupervised learning

Idea : group typical input data in function of
resemblance criteria un-known a priori

Data clustering

No need of a professor

The network finds itself the correlations between the
data
Properties of Neural Networks



Supervised networks are universal approximators (Non
recurrent networks)
Theorem : Any limited function can be approximated by
a neural network with a finite number of hidden neurons
to an arbitrary precision
Type of Approximators


Linear approximators : for a given precision, the number of
parameters grows exponentially with the number of variables
(polynomials)
Non-linear approximators (NN), the number of parameters grows
linearly with the number of variables
Other properties

Adaptivity
 Adapt

Generalization ability
 May

weights to environment and retrained easily
provide against lack of data
Fault tolerance
 Graceful
degradation of performances if damaged =>
The information is distributed within the entire net.
Example
Classification (Discrimination)
Class objects in defined categories
 Rough decision OR
 Estimation of the probability for a certain
object to belong to a specific class
Example : Data mining
 Applications : Economy, speech and
patterns recognition, sociology, etc.

Example
Examples of handwritten postal codes
drawn from a database available from the US Postal service
What do we need to use NN ?






Determination of pertinent inputs
Collection of data for the learning and testing
phase of the neural network
Finding the optimum number of hidden nodes
Estimate the parameters (Learning)
Evaluate the performances of the network
IF performances are not satisfactory then review
all the precedent points
Why we need backpropagation


Networks without hidden units are very limited in the
input-output mappings they can model.
 More layers of linear units do not help. Its still linear.
 Fixed output non-linearities are not enough
We need multiple layers of adaptive non-linear hidden
units. This gives us a universal approximator. But how
can we train such nets?
 We need an efficient way of adapting all the weights,
not just the last layer. This is hard. Learning the
weights going into hidden units is equivalent to
learning features.
 Nobody is telling us directly what hidden units should
do.
Learning by perturbing weights


Randomly perturb one weight and see
if it improves performance. If so, save
the change.
output units
 Very inefficient. We need to do
multiple forward passes on a
representative set of training data
hidden units
just to change one weight.
 Towards the end of learning, large
weight perturbations will nearly
input units
always make things worse.
We could randomly perturb all the
weights in parallel and correlate the Learning the hidden to output
weights is easy. Learning the
performance gain with the weight
input to hidden weights is hard.
changes.
 Not any better because we need
lots of trials to “see” the effect of
changing one weight through the
noise created by all the others.
The idea behind back propagation

We don’t know what the hidden units ought to do, but we
can compute how fast the error changes as we change a
hidden activity.
 Instead of using desired activities to train the hidden
units, use error derivatives w.r.t. hidden activities.
 Each hidden activity can affect many output units and
can therefore have many separate effects on the error.
These effects must be combined.
 We can compute error derivatives for all the hidden units
efficiently.
 Once we have the error derivatives for the hidden
activities, its easy to get the error derivatives for the
weights going into a hidden unit.
Some Success Stories

Back-propagation has been used for a
large number of practical applications.
 Recognizing
hand-written characters
 Predicting the future price of stocks
 Detecting credit card fraud
 Recognize speech
 Predicting the next word in a sentence from
the previous words
Overview of the applications in this lecture



Modeling relational data
 This toy application shows that the hidden units can
learn to represent sensible features that are not at all
obvious.
 It also bridges the gap between relational graphs and
feature vectors.
Learning to predict the next word in a sentence
 The toy model above can be turned into a useful
model for predicting words to help a speech
recognizer.
Reading documents
 An impressive application that is used to read checks.
Multi-Layer Perceptron

Output layer

2nd hidden
layer
1st hidden
layer
Input data
One or more hidden
layers
Sigmoid activations
functions
Learning

Back-propagation algorithm
Credit assignment
n
net j  w j 0   w ji oi
o j  f j (net j 
E
j 
net j
i
E
E net j
Dw ji  
 
  j oi
w ji
net j w ji
E o j
E
j 

f (net j )
o j net j
o j
1
E
E  (t j  o j )² 
 (t j  o j )
2
o j
 j  (t j  o j ) f ' (net j )
If the jth node is an output unit
E
 E net

 k
 k  k wkj
o j
net o j
 j  f ' j (net j )k  k wkj

Momentum term to smooth
The weight changes over time
Dw ji (t )   j (t )oi (t )  Dw ji (t  1)
w ji (t )  w ji (t  1)  Dw ji (t )
Different non linearly separable
problems
Structure
Single-Layer
Two-Layer
Three-Layer
Types of
Decision Regions
Exclusive-OR
Problem
Half Plane
Bounded By
Hyperplane
A
B
B
A
Convex Open
Or
Closed Regions
A
B
Abitrary
(Complexity
Limited by No.
of Nodes)
B
A
A
B
B
A
Neural Networks – An Introduction Dr. Andrew Hunter
Classes with Most General
Meshed regions Region Shapes
B
B
B
A
A
A
Feedforword NNs

The basic structure off a feedforward Neural Network

The learning rule modifies the weights according to the input patterns that it is
presented with. In a sense, ANNs learn by example as do their biological
counterparts.
When the desired output are known we have supervised learning or learning with a
teacher.

Slide 80
An overview of the
backpropagation
1.
2.
3.
A set of examples for training the network is assembled. Each case consists of a problem
statement (which represents the input into the network) and the corresponding solution
(which represents the desired output from the network).
The input data is entered into the network via the input layer.
Each neuron in the network processes the input data with the resultant values steadily
"percolating" through the network, layer by layer, until a result is generated by the output
layer.
4.
The actual output of the network is
compared to expected output for
that particular input. This results in
an error value.. The connection
weights in the network are gradually
adjusted, working backwards from
the output layer, through the hidden
layer, and to the input layer, until the
correct output is produced. Fine
tuning the weights in this way has
the effect of teaching the network
how to produce the correct output
for a particular input, i.e. the
network learns.
The Learning Rule

The delta rule is often utilized by the most common class
of ANNs called backpropagational neural networks.
Input

Desired
Output
When a neural network is initially presented with a
pattern it makes a random guess as to what it might be. It
then sees how far its answer was from the actual one
and makes an appropriate adjustment to its connection
weights.
The Insides off
Delta Rule

Backpropagation performs a gradient descent within the solution's
vector space towards a global minimum.
The error surface itself is a hyperparaboloid but is seldom smooth as is
depicted in the graphic below. Indeed, in most problems, the solution
space is quite irregular with numerous pits and hills which may cause
the network to settle down in a local minimum which is not the best
overall solution.
Early stopping
•Training data
•Validation data
•Test data
Slide 84
Design Considerations





What transfer function should be used?
How many inputs does the network need?
How many hidden layers does the network
need?
How many hidden neurons per hidden layer?
How many outputs should the network have?
There is no standard methodology to determinate these values. Even there
is some heuristic points, final values are determinate by a trial and error
procedure.
Characteristics of NNs






Learning from experience: Complex difficult to solve problems, but
with plenty of data that describe the problem
Generalizing from examples: Can interpolate from previous learning
and give the correct response to unseen data
Rapid applications development: NNs are generic machines and
quite independent from domain knowledge
Adaptability: Adapts to a changing environment, if is properly
designed
Computational efficiency: Although the training off a neural network
demands a lot of computer power, a trained network demands
almost nothing in recall mode
Non-linearity: Not based on linear assumptions about the real word
Pre-processing

Transform data to NN inputs
 Applying
a mathematical or statistical function
 Encoding textual data from a database


Selection of the most relevant data and outlier
removal
Minimizing network inputs
 Feature
extraction
 Principal components analysis
 Waveform / Image analysis

Coding pre-processing data to network inputs
Different types of Neural
Networks

Feed-forward networks
 Feed-forward
NNs allow signals to travel one
way only; from input to output. There is no
feedback (loops) i.e. the output of any layer
does not affect that same layer.
 Feed-forward NNs tend to be straight forward
networks that associate inputs with outputs.
They are extensively used in pattern
recognition.
 This type of organization is also referred to as
bottom-up or top-down.
Continued

Feedback networks




Feedback networks can have signals traveling in both directions
by introducing loops in the network.
Feedback networks are dynamic; their 'state' is changing
continuously until they reach an equilibrium point.
They remain at the equilibrium point until the input changes and
a new equilibrium needs to be found.
Feedback architectures are also referred to as interactive or
recurrent, although the latter term is often used to denote
feedback connections in single-layer organizations.
Diagram of an NN
Fig: A simple Neural Network
Network Layers



Input Layer - The activity of the input units represents the
raw information that is fed into the network.
Hidden Layer - The activity of each hidden unit is
determined by the activities of the input units and the
weights on the connections between the input and the
hidden units.
Output Layer - The behavior of the output units depends
on the activity of the hidden units and the weights
between the hidden and output units.
Continued


This simple type of network is interesting because the
hidden units are free to construct their own
representations of the input.
The weights between the input and hidden units
determine when each hidden unit is active, and so by
modifying these weights, a hidden unit can choose what
it represents.
Network Structure


The number of layers and of neurons depend on the
specific task. In practice this issue is solved by trial and
error.
Two types of adaptive algorithms can be used:


start from a large network and successively remove some
neurons and links until network performance degrades.
begin with a small network and introduce new neurons until
performance is satisfactory.
Network Parameters
How are the weights initialized?
 How many hidden layers and how many
neurons?
 How many examples in the training set?

Preprocessing
Why Preprocessing ?

The curse of Dimensionality
 The
quantity of training data grows
exponentially with the dimension of the input
space
 In practice, we only have limited quantity of
input data

Increasing the dimensionality of the problem leads
to give a poor representation of the mapping
Preprocessing methods

Normalization
 Translate
input values so that they can be
exploitable by the neural network

Component reduction
 Build
new input variables in order to reduce
their number
 No Lost of information about their distribution
Character recognition example


Image 256x256 pixels
8 bits pixels values
(grey level)
22562568  10158000different images

Necessary to extract
features
Normalization
Inputs of the neural net are often of
different types with different orders of
magnitude (E.g. Pressure, Temperature,
etc.)
 It is necessary to normalize the data so
that they have the same impact on the
model
 Center and reduce the variables

1
xi 
N
n
x
n1 i
N
Average on all points
(
1
N
n
 
x
 xi

n 1 i
N 1
2
i
x 
n
i
x  xi
n
i
i

2
Variance calculation
Variables transposition
Components reduction




Sometimes, the number of inputs is too large to
be exploited
The reduction of the input number simplifies the
construction of the model
Goal : Better representation of the data in order
to get a more synthetic view without losing
relevant information
Reduction methods (PCA, CCA, etc.)
Principal Components Analysis
(PCA)

Principle





Linear projection method to reduce the number of parameters
Transfer a set of correlated variables into a new set of
uncorrelated variables
Map the data into a space of lower dimensionality
Form of unsupervised learning
Properties


It can be viewed as a rotation of the existing axes to new
positions in the space defined by original variables
New axes are orthogonal and represent the directions with
maximum variability




Compute d dimensional mean
Compute d*d covariance matrix
Compute eigenvectors and Eigenvalues
Choose k largest Eigenvalues



K is the inherent dimensionality of the subspace governing the
signal
Form a d*d matrix A with k columns of eigenvectors
The representation of data consists of projecting data into
a k dimensional subspace by
x  A (x  )
t
Example of data representation
using PCA
Limitations of PCA

The reduction of dimensions for complex
distributions may need non linear
processing
Curvilinear Components
Analysis





Non linear extension of the PCA
Can be seen as a self organizing neural network
Preserves the proximity between the points in
the input space i.e. local topology of the
distribution
Enables to unfold some varieties in the input
data
Keep the local topology
Example of data representation
using CCA
Non linear projection of a spiral
Non linear projection of a horseshoe
Other methods

Neural pre-processing
 Use
a neural network to reduce the
dimensionality of the input space
 Overcomes the limitation of PCA
 Auto-associative mapping => form of
unsupervised training
D dimensional output space
x1 x2
….
xd

M dimensional sub-space
z1
zM


x1 x2
….
D dimensional input space
xd
Transformation of a d
dimensional input space
into a M dimensional
output space
Non linear component
analysis
The dimensionality of the
sub-space must be
decided in advance
« Intelligent preprocessing »
Use an “a priori” knowledge of the problem
to help the neural network in performing its
task
 Reduce manually the dimension of the
problem by extracting the relevant features
 More or less complex algorithms to
process the input data

Example in the H1 L2 neural
network trigger

Principle

Intelligent preprocessing



extract physical values for the neural net (impulse, energy, particle
type)
Combination of information from different sub-detectors
Executed in 4 steps
Clustering
find regions of
interest
within a given
detector layer
Matching
Ordering
combination of clusters sorting of objects
belonging to the same
by parameter
object
Post
Processing
generates
variables
for the
neural network
Conclusion on the preprocessing




The preprocessing has a huge impact on
performances of neural networks
The distinction between the preprocessing and
the neural net is not always clear
The goal of preprocessing is to reduce the
number of parameters to face the challenge of
“curse of dimensionality”
It exists a lot of preprocessing algorithms and
methods
 Preprocessing
with prior knowledge
 Preprocessing without