w 0 u

Overview over different methods – Supervised Learning
M a c h in e L e a rn in g
C la s s ic a l C o n d itio n in g
A n tic ip a to r y C o n tr o l o f A c tio n s a n d P r e d ic tio n o f V a lu e s
S y n a p tic P la s tic ity
C o r r e la tio n o f S ig n a ls
R E IN F O R C E M E N T L E A R N IN G
U N -S U P E R V IS E D L E A R N IN G
e x a m p le b a s e d
c o r r e la tio n b a s e d
D y n a m ic P ro g .
(B e llm a n E q .)
d -R u le
And many more
s u p e r v is e d L .
H e b b -R u le
=
R e s c o rla /
Wagner
LT P
( LT D = a n ti)
=
E lig ib ilit y T ra c e s
T D (l )
o fte n l = 0
T D (1 )
T D (0 )
D iffe re n tia l
H e b b -R u le
(”s lo w ”)
= N e u r.T D - fo r m a lis m
M o n te C a rlo
C o n tro l
S T D P -M o d e ls
A c to r /C r itic
IS O - L e a r n in g
( “ C r itic ” )
IS O - M o d e l
of STDP
SARSA
B io p h y s . o f S y n . P la s tic ity
C o r r e la tio n
b a s e d C o n tr o l
( n o n - e v a lu a t iv e )
IS O -C ontrol
STD P
b io p h y s ic a l & n e tw o r k
N e u r.T D - M o d e ls
te c h n ic a l & B a s a l G a n g l.
Q -L e a rn in g
D iffe re n tia l
H e b b -R u le
(”fa s t”)
D o p a m in e
G lu ta m a te
N e u ro n a l R e w a rd S y s te m s
(B a s a l G a n g lia )
N O N -E VA L U AT IV E F E E D B A C K (C o rre la tio n s )
E VA L U AT IV E F E E D B A C K (R e w a rd s )
Some more basics:
Threshold Logic Unit (TLU)
inputs
u1
u2
.
.
.
un
w1
w2
wn
weights

output
activation
a=i=1n wi ui
v=
{
q
1 if a  q
0 if a < q
v
Activation Functions
v
threshold
a
piece-wise linear
v
a
linear
v
a
sigmoid
v
a
Decision Surface of a TLU
1
>q
1
1
u2
1
Decision line
w1 u1 + w2 u2 = q
0
0
0
0
<q
0
u1
Scalar Products & Projections
u
j
u
u
w
w•v>0
w
j
j
w•v=0
w•v<0
u
j
w
w
w • u = |w||u| cos j
Geometric Interpretation
The relation w•u=q implicitly defines the decision line
u2
Decision line
w1 u1 + w2 u2 = q
w•u=q
w
v=1
uw
|uw|=q/|w|
u1
u
v=0
Geometric Interpretation



In n dimensions the relation w•u=q defines a
n-1 dimensional hyper-plane, which is perpendicular
to the weight vector w.
On one side of the hyper-plane (w•u>q) all
patterns are classified by the TLU as “1”, while those
that get classified as “0” lie on the other side of the
hyper-plane.
If patterns can be not separated by a hyper-plane
then they cannot be correctly classified with a TLU.
Linear Separability
w1=1
w2=1
q=1.5
u2
0
1
0
0
u2
w1=?
w2=?
q= ?
u1
0
1
1
0
u1
Logical XOR
Logical AND
u1
u2
a
0
0
0
v
0
u1
0
u2
0
v
0
0
1
1
0
0
1
0
1
1
1
0
1
1
1
0
1
0
1
1
1
2
Threshold as Weight
u1
u2
.
.
.
un
w1
w2
wn
un+1=-1
q=wn+1
wn+1

v
a= i=1n+1 wi ui
v=
{
1 if a  0
0 if a <0
Geometric Interpretation
The relation w•u=0 defines the decision line
u2
w
Decision line
w•u=0
v=1
u1
v=0
u
Training ANNs

Training set S of examples {u,vt}
u is an input vector and
t
 v the desired target output
 Example: Logical And
S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}


Iterative process


Present a training example u , compute network output v ,
compare output v with target vt, adjust weights and thresholds
Learning rule

Specifies how to change the weights w and thresholds q of the
network as a function of the inputs u, output v and target vt.
Adjusting the Weight Vector
u
u
w’ = w + mu
mu
j>90
vt
Target =1
Output v=0
w
w
Move w in the direction of u
u
u
w
-mu
j<90
vt
Target =0
Output v=1
w
w’ = w - mu
Move w away from the direction of u
Perceptron Learning Rule

w’=w + m (vt-v) u
Or in components

w’i = wi + Dwi = wi + m (vt-v) ui (i=1..n+1)
With wn+1 = q and un+1=-1
 The parameter m is called the learning rate. It
determines the magnitude of weight updates Dwi .
t
 If the output is correct (v =v) the weights are not
changed (Dwi =0).
t
 If the output is incorrect (v  v) the weights wi are
changed such that the output of the TLU for the new
weights w’i is closer/further to the input ui.
Perceptron Training Algorithm
Repeat
for each training vector pair (u, vt)
evaluate the output y when u is the input
if vvt then
form a new weight vector w’ according
to w’=w + m (vt-v) u
else
do nothing
end if
end for
Until v=vt for all training vector pairs
Perceptron Convergence Theorem



The algorithm converges to the correct classification
 if the training data is linearly separable
 and m is sufficiently small
If two classes of vectors {u1} and {u2} are linearly
separable, the application of the perceptron training
algorithm will eventually result in a weight vector w0,
such that w0 defines a TLU whose decision hyper-plane
separates {u1} and {u2} (Rosenblatt 1962).
Solution w0 is not unique, since if w0 u =0 defines a
hyper-plane, so does w’0 = k w0.
Linear Separability
w1=1
w2=1
q=1.5
u2
0
1
0
0
u2
w1=?
w2=?
q= ?
u1
0
1
1
0
u1
Logical XOR
Logical AND
u1
u2
a
0
0
0
v
0
u1
0
u2
0
v
0
0
1
1
0
0
1
0
1
1
1
0
1
1
1
0
1
0
1
1
1
2
Generalized Perceptron Learning Rule
If we do not include the threshold as an input we use
the follow description of the perceptron with
symmetrical outputs (this does not matter much, though):
n
v=
+ 1 i f w áu à ò õ 0
à 1 i f w áu à ò < 0
Then we get for the learning rule:
w! w+
ö
t
(
v
2
à v) u
and
ò ! òà
ö
t
(
v
2
à v)
This implies:
( w áu à ò) ! ( w áu à ò) +
ö
t
(
v
2
à v)(j u j 2 + 1)
Hence, if vt=1 and v=-1 the weight change increase the term
w.u-q and vice versa. This is what we need to compensate
the error!
Linear Unit – no Threshold!
inputs
u1
u2
.
.
.
w1
w2
wn
weights

activation output
v
a=i=1n wi ui
v= a =
i=1n wi vi
un
Lets abbreviate the target output (vectors) by t in the
next slides
Gradient Descent Learning Rule


Consider linear unit without threshold and
continuous output v (not just –1,1)
 v=w0 + w1 u1 + … + wn un
Train the wi’s such that they minimize the
squared error

E[w1,…,wn] = ½
dD (td-vd)2
where D is the set of training examples and
t the target outputs.
Gradient Descent
D={<(1,1),1>,<(-1,-1),1>,
<(1,-1),-1>,<(-1,1),-1>}
Gradient:
E[w]=[E/w0,… ,E/wn]
Dw=-m E[w]
-1/m
Dwi= - E/wi
=/wi 1/2d(td-vd)2
= /wi 1/2d(td-i wi ui)2
= d(td- vd)(-ui)
(w1,w2)
(w1+Dw1,w2 +Dw2)
Gradient Descent
Gradient-Descent(training_examples, m)
Each training example is a pair of the form {(u1,…un),t} where (u1,…,un)
is the vector of input values, and t is the target output value, m is the
learning rate (e.g. 0.1)

Initialize each wi to some small random value

Until the termination condition is met, Do
 Initialize each Dwi to zero
 For each {(u1,…un),t} in training_examples Do
 Input the instance (u1,…,un) to the linear unit and compute the
output v
 For each linear unit weight wi Do
 Dwi= Dwi + m (t-v) ui
 For each linear unit weight wi Do
 wi=wi+Dwi
Incremental Stochastic
Gradient Descent

Batch mode : gradient descent
w=w - m ED[w] over the entire data D
ED[w]=1/2d(td-vd)2

Incremental (stochastic) mode: gradient descent
w=w - m Ed[w] over individual training examples d
Ed[w]=1/2 (td-vd)2
Incremental Gradient Descent can approximate Batch
Gradient Descent arbitrarily closely if m is small
enough.
Perceptron vs. Gradient Descent Rule


perceptron rule
w’i = wi + m (tp-vp) uip
derived from manipulation of decision surface.
gradient descent rule
w’i = wi + m (tp-vp) uip
derived from minimization of error function
E[w1,…,wn] = ½ p (tp-vp)2
by means of gradient descent.
Perceptron vs. Gradient Descent Rule
Perceptron learning rule guaranteed to succeed if
 Training examples are linearly separable
 Sufficiently small learning rate m.
Linear unit training rules using gradient descent
 Guaranteed to converge to hypothesis with minimum
squared error
 Given sufficiently small learning rate m
 Even when training data contains noise
 Even when training data not separable by hyperplane.
Presentation of Training Examples


Presenting all training examples once to the
ANN is called an epoch.
In incremental stochastic gradient descent
training examples can be presented in



Fixed order (1,2,3…,M)
Randomly permutated order (5,2,7,…,3)
Completely random (4,1,7,1,5,4,……)
Neuron with Sigmoid-Function
inputs
x1
x2
.
.
.
xn
w1
w2
wn
weights

activation
output
y
a=i=1n wi xi
y=s(a) =1/(1+e-a)
Sigmoid Unit
u1
u2
.
.
.
un
w1
w2
wn
x0=-1

w0 a=i=0n wi ui
v=s(a)=1/(1+e-a)
v
s(x) is the sigmoid function: 1/(1+e-x)
ds(x)/dx= s(x) (1- s(x))
Derive gradient decent rules to train:
• one sigmoid function
E/wi = -p(tp-v) v (1-v) uip
• Multilayer networks of sigmoid units
backpropagation:
Gradient Descent Rule for Sigmoid
Output Function
s
sigmoid
Ep[w1,…,wn] = ½ (tp-vp)2
Ep/wi = /wi ½ (tp-vp)2
a
s’
a
= /wi ½ (tp- s(i wi uip))2
= (tp-up) s‘(i wi uip) (-uip)
for v=s(a) = 1/(1+e-a)
s’(a)= e-a/(1+e-a)2=s(a) (1-s(a))
w’i= wi + Dwi = wi + m v(1-v)(tp-vp) uip
Gradient Descent Learning Rule
vj
wji
ui
Dwi = m vjp(1-vjp) (tjp-vjp) uip
learning rate
derivative of
activation function
activation of
pre-synaptic neuron
error dj of
post-synaptic neuron
ALVINN
Automated driving at 70 mph on a
public highway
Camera
image
30 outputs
for steering
4 hidden
units
30x32 pixels
as inputs
30x32 weights
into one out of
four hidden
unit