Activation Function

1
Introduction to Neural
Networks
Marina Brozović
References: talks by Matyas Bartok; Imperial College
Goeffrey Hinton, U of Toronto
Diana Spears, U of Wayoming
Nadia Bolshakova. Trinity College Dublin
1
Neural Network Characteristics



2
2
Basis:
– a crude low-level model of biological neural systems
Powerful:
– capable of modeling very complex
functions/relationships
Ease of Use:
– learns the structure for you, i.e. avoids need for
formulating rules
– user must deal with type of network, complexity,
learning algorithms & inputs to use
History of Neural Networks
Attempts
3
to mimic the human brain date back to work in the
1930s, 1940s, & 1950s by Alan Turing, Warren McCullough,
Walter Pitts, Donald Hebb and James von Neumann
1957
Rosenblatt at Cornell developed Perceptron, a
hardware neural net for character recognition
1959
Widrow and Hoff at Stanford developed NN for
adaptive control of noise on telephone lines
3
History of Neural Networks, cont.
1960s
& 1970s period hindered by inflated claims
and criticisms of the early work
1982
Hopfield, a Caltech physicist, mathematically
tied together many of the ideas from previous
research.
Since
then, growth has exploded. Thousands of
research papers and commercial software
applications
4
4
Types of Problems
5

Classification: determine to which of a discrete
number of classes a given input case belongs
– equivalent to logistic regression

Regression: predict the value of a (usually) continuous
variable
– equivalent to least-squares linear regression

Times series - you wish to predict the value of
variables from earlier values of the same or other
variables
5
Basic Artificial Model
6

Consists of simple processing elements called neurons, units
or nodes

Each neuron is connected to other nodes with an associated
weight(strength) which typically multiplies the signal
transmitted

Each neuron has a single threshold value

Weighted sum of all the inputs coming into the neuron is
formed and the threshold is subtracted from this value =
activation

Activation signal is passed through an activation
function(a.k.a. transfer function) to produce the output of the
neuron
6
Real vs artificial neurons
7
dendrites
axon
cell
synapse
dendrites
x1
w1

xn
7
wn
Threshold units
n
w x
i i
o
i
o  1 if
n
w x
i i
i
 0 and 0 otherwise
Processing at a Node
Activation Function
Output
1.0
Sum
0.5
Sum
8
8
Transfer Functions
9
Determines how neuron scales its response to incoming signals
y
y
1
1
0
Hard-Limit
9
x
y
y
1
0
Sigmoid
x
x
Radial Basis
0
x
Threshold Logic
Purpose of the
Activation Function g



10
We want the unit to be “active” (near +1) when the
“right” inputs are given
We want the unit to be “inactive” (near 0) when the
“wrong” inputs are given.
It’s preferable for g to be nonlinear. Otherwise, the
entire neural network collapses into a simple linear
function.
10
Types of connectivity


Feedforward networks
– These compute a series of
transformations
– Typically, the first layer is the
input and the last layer is the
output.
Recurrent networks
– These have directed cycles in
their connection graph. They
can have complicated
dynamics.
– More biologically realistic.
11
11
output units
hidden units
input units
Perceptrons
12


oi  g  W ji x j 
 j

12
Types of Training
13
Supervised: most common
– user provides data set of inputs with corresponding output(s)
– training is performed until the network learns to associate each
input vector to its respective and desired output vector
– the relationship learned can then be applied to new data with
unknown output(s)
Unsupervised
– only input vectors are supplied
– network learns some internal features of the whole set of input
vectors presented to it
Reinforcement learning
– combination of the above two paradigms
13
Math Review
Dot product:


v  [v1 , v2 ,..., vn ]; w  [ w1 , w2 ,..., wn ]
 
v  w  v1w1  v2 w2  ...  vn wn
Partial derivative:
f (v1 , v2 ,..., vn )
vi
Hold all vj where j = i fixed, then
take the derivative.
Gradient:
  f f
f 
f (v )   ,
,...,


v

v

v
2
n
 1
14
The gradient points in the
direction of steepest ascent
of the function f.
14
Perceptron Training


Can be trained on a set of examples
using a special learning rule
(process)
Weights are changed in proportion to
the difference (error) between target
output and perceptron solution for
each example (p).
 Minimize summed square error
function:
E = 1/2 ∑p∑i(oi(p) - ti(p))2
with respect to the weights.
15
oi  g ( w ji x j )
oi
wji
xj
15
Perceptron Training



Error minimized by finding set of weights that
correspond to global minimum.
Done with gradient descent method – (weights
incrementally updated in proportion to δE/δwij)
Updating reads: wji(t + 1) = wji(t) – Δwji, where
Δwji = -η δE/δwji.

Aim is to produce a true mapping for all patterns
16
16
Why does the method work?

For perceptrons, the
error surface in weight
space has a single
global minimum and
no local minima.
Gradient descent is
guaranteed to find the
global minimum,
provided the learning
rate is not so big that
that you overshoot it.
17
17
Math Review
Chain rule for derivatives:
(the chain rule can also be applied if these are partial derivatives)
df ( g ( x)) df dg

dx
dg dx
18
18
Derivation
E
w ji  w ji  
w ji
1
2
E   (ti  oi ) ,
2 i
oi 
1
,
  w ji x j
E
E oi

w ji oi w ji
oi
oi  ( w ji x j )

w ji  ( w ji x j ) w ji
1 e j
oi

 oi (1  oi ) x j
w ji

19
E
 (ti  oi )oi (1  oi ) x j
w ji
19
Learning Rule for Sigmoid unit
x1
w1


xn
“neti”
w
j
ji
xj
20
Sigmoid unit for g
oi
wn
w ji  w ji   (ti  oi )oi (1  oi ) x j
20
Problems

Perceptrons can only perform accurately
with linearly separable classes (linear
21
x1
hyperplane can place one class of objects on one
side of plane and other class on other)
x2


ANN research put on hold for 20yrs.
Solution: additional (hidden) layers of
neurons, MLP architecture

x1
Able to solve non-linear classification
problems
x2
21
Hidden Units
Layer of nodes between input
and output nodes
 Allow a network to learn nonlinear functions
 Allow the net to represent
combinations of the input
features
 With two possibly very large
hidden layers, it is possible to
implement any function
22

22
output units
hidden units
input units
Common Questions About
Neural Networks
How
– As
23
many hidden layers should I use?
problem complexity increases, number of hidden layers
should also increase
– Start with none. Add hidden layers one at a time if training
or testing results do not achieve target accuracy levels
23
Learning the Weights
Between Hidden and Output
24
Recall the perceptron learning rule;
W ji  W ji    (ti o i )  g ' (neti )  x j
oi
xj
 i  (ti  oi )  g ' (neti )
Error at output node i
yk
Weight update:
24
W ji  W ji    x j   i
Error Back-Propagation to Learn
Weights Between Input and Hidden

25
Key idea: each hidden node is responsible for
some fraction of the error in each of the
output nodes. This fraction equals the
strength of the connection (weight) between
the hidden node and the output node.
 j  error at hidden node j  g' (net j )W ji i
i
where  i is the error at output node i.
Errj
25
Learning Between
Input and Hidden
26
The update rule is now the standard, with notation
adjusted to suit the situation:
Wkj  Wkj    yk   j
oi
xj
26
yk
Summary of BP learning algorithm
1.
2.
Initialize wji and wkj with random values.
Repeat until wji and wkj have converged or the desired
performance level is reached:



27
Pick pattern p from training set.
Present input and calculate the output.
Update weights according to:
wji(t + 1) = wji(t) – Δwji
wkj(t + 1) = wkj(t) – Δwkj
where Δw = -η δE/δw.
(…etc…for extra hidden layers).
27
Does BP always work on MLP?
Training of MLP using BP
can be thought of as a walk
in weight space along an
energy surface, trying to find
global minimum and
avoiding local minima
Unlike for Perceptron, there
is no guarantee that global
minimum will be reached,
but most cases energy
landscape is smooth
28
28
Other Learning Algorithms


29
Back Propagation (first derivative): most
popular
– modifications exist: momentum
propagation, resilient backprop
Others (some use second derivative of the
error surface): Conjugate gradient descent,
Levenberg-Marquardt, Quasi-Newton
algorithm…
29
Future reading/exploring…
• The GURU of neural nets: Geoffrey Hinton
www.cs.toronto.edu/~hinton/ (lookup his excellent NN
lecture notes)
• Matlab has a very good NN toolbox
• Classical textbook : Christopher M Bishop, “Neural
Networks for Pattern Recognition”
• Other useful mathematical algorithms: Kalman Filters
30
30