red The Perception Learning Rule

Institute for Advanced Management Systems Research
Department of Information Technologies
Åbo Akademi University
The Perception Learning Rule - Tutorial
Robert Fullér
Directory
• Table of Contents
• Begin Article
c 2010
November 4, 2010
[email protected]
Table of Contents
1. Artificial neural networks
2. The perception learning rule
3. The perceptron learning algorithm
4. Illustration of the perceptron learning algorithm
3
1. Artificial neural networks
Artificial neural systems can be considered as simplified mathematical models of brain-like systems and they function as parallel distributed computing networks.
However, in contrast to conventional computers, which are programmed to perform specific task, most neural networks must
be taught, or trained.
They can learn new associations, new functional dependencies
and new patterns. Although computers outperform both biological and artificial neural systems for tasks based on precise
and fast arithmetic operations, artificial neural systems represent the promising new generation of information processing
networks. The study of brain-style computation has its roots
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 1: Artificial neural networks
4
over 50 years ago in the work of McCulloch and Pitts (1943)
and slightly later in Hebb’s famous Organization of Behavior
(1949).
The early work in artificial intelligence was torn between those
who believed that intelligent systems could best be built on
computers modeled after brains, and those like Minsky and Papert (1969) who believed that intelligence was fundamentally
symbol processing of the kind readily modeled on the von Neumann computer.
For a variety of reasons, the symbol-processing approach became the dominant theme in Artifcial Intelligence in the 1970s.
However, the 1980s showed a rebirth in interest in neural computing:
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 1: Artificial neural networks
5
1982 Hopfield provided the mathematical foundation for understanding the dynamics of an important class of networks.
1984 Kohonen developed unsupervised learning networks for
feature mapping into regular arrays of neurons.
1986 Rumelhart and McClelland introduced the backpropagation learning algorithm for complex, multilayer networks.
Beginning in 1986-87, many neural networks research programs
were initiated. The list of applications that can be solved by
neural networks has expanded from small test-size examples
to large practical tasks. Very-large-scale integrated neural network chips have been fabricated.
In the long term, we could expect that artificial neural systems
will be used in applications involving vision, speech, decision
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 1: Artificial neural networks
6
making, and reasoning, but also as signal processors such as
filters, detectors, and quality control systems.
Definition 1.1. Artificial neural systems, or neural networks,
are physical cellular systems which can acquire, store, and utilize experiental knowledge.
The knowledge is in the form of stable states or mappings embedded in networks that can be recalled in response to the presentation of cues.
The basic processing elements of neural networks are called
artificial neurons, or simply neurons or nodes.
Each processing unit is characterized by an activity level (representing the state of polarization of a neuron), an output value
(representing the firing rate of the neuron), a set of input conToc
JJ
II
J
I
Back
J Doc
Doc I
mappings embedded in networks that can be recalled
presentation of cues.
Section 1: Artificial
neural networks
in response
to the
Output patterns
Hidden nodes
7
Output
Hidden nodes
Input
Input patterns
Figure 1: A multi-layer feedforward neural network.
Figure
1: Aprocessing
multi-layerelements
feedforward
neural network.
The
basic
of neural
networks
4
nections, (representing synapses on the cell and its dendrite), a
bias value (representing an internal resting level of the neuron),
and a set of output connections (representing a neuron’s axonal
projections).
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 1: Artificial neural networks
8
Each of these aspects of the unit are represented mathematically by real numbers. Thus, each connection has an associated
weight (synaptic strength) which determines the effect of the
incoming input on the activation level of the unit. The weights
tive
may
be (inhibitory).
positive (excitatory) or negative (inhibitory).
x1
w1
! f
xn
"
wn
Figure 2: A processing element with single output connection.
Figure 2: A processing element with single output connection.
The signal flow from of neuron inputs, x , is co
j be
The signal flow from of neuron inputs, xj , is considered to
sidered to be
unidirectionalas
arrow
unidirectionalas
indicated
by arrows, as isindicated
a neuron’s by
output
II output
J
I
Doc
Doc I
Back
asTocis a JJ
neuron’s
signal
flow.JThe
neuron
o
Section 1: Artificial neural networks
9
signal flow. The neuron output signal is given by,
X
n
T
o = f (hw, xi) = f (w x) = f
w j xj
j=1
where w = (w1 , . . . , wn )T ∈ Rn is the weight vector. The function f (wT x) is often referred to as an activation (or transfer)
function. Its domain is the set of activation values, net, of the
neuron model, we thus often use this function as f (net). The
variable net is defined as a scalar product of the weight and
input vectors
net = hw, xi = wT x = w1 x1 + · · · + wn xn
and in the simplest case the output value o is computed as
1 if wT x ≥ θ
o = f (net) =
0 otherwise,
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 1: Artificial neural networks
10
where θ is called threshold-level and this type of node is called
a linear threshold unit.
Example 1.1. Suppose we have two Boolean inputs x1 , x2 ∈
{0, 1}, one Boolean output o ∈ {0, 1} and the training set is
given by the following input/output pairs
o(x1 , x2 ) = x1 ∧ x2
x1
x2
1.
2.
1
1
1
0
3.
0
1
0
4.
0
0
0
1
0
Then the learning problem is to find weight w1 and w2 and
threshold (or bias) value θ such that the computed output of
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 1: Artificial neural networks
11
our network (which is given by the linear threshold function) is
equal to the desired output for all examples. A straightforward
solution is w1 = w2 = 1/2, θ = 0.6. Really, from the equation
o(x1 , x2 ) =
1 if x1 /2 + x2 /2 ≥ 0.6
0 otherwise
it follows that the output neuron fires if and only if both inputs
are on.
Example 1.2. Suppose we have two Boolean inputs x1 , x2 ∈
{0, 1}, one Boolean output o ∈ {0, 1} and the training set is
given by the following input/output pairs
Toc
JJ
II
J
I
Back
J Doc
Doc I
both inputs are on.
Section 1: Artificial neural networks
12
x2
1/2x1 +1/2x 2 = 0.6
x1
A solution
the learning
problem
ofBoolean
Boolean
Figure 3: Ato
solution
to the learning
problem of
and. and.
Example 2. Suppose we have two Boolean inputs
x1 x2 o(x1 , x2 ) = x1 ∨ x2
one Boolean output o ∈ {0, 1}
x1, x2 ∈ {0, 1},
1.
2.
3.
4.
Toc
JJ
1
1
0
0
II
1
0
1
0
1
1
1
0
8
J
I
Back
J Doc
Doc I
Section 1: Artificial neural networks
13
Then the learning problem is to find weight w1 and w2 and
threshold value θ such that the computed output of our network
is equal to the desired output for all examples. A straightforward solution is w1 = w2 = 1, θ = 0.8. Really, from the
equation,
o(x1 , x2 ) =
1 if x1 + x2 ≥ 0.8
0 otherwise,
it follows that the output neuron fires if and only if at least one
of the inputs is on.
The removal of the threshold from our network is very easy by
increasing the dimension of input patterns. Really, the identity
w1 x1 + · · · + wn xn > θ ⇐⇒
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 1: Artificial neural networks
14
!
!
0
"
w1
w1
wn
xn
x1
"
wn
x1
xn
-1
Removing the threshold.
Figure 4: Removing the threshold.
The removal of the threshold from our network is
w1by
x1 increasing
+ · · · + wthe
−1×θ >
n xndimension
very easy
of 0input patterns. Really, the identity
means that by adding an extra neuron to the input layer with
+ · ·weight
· + wnxθn >
⇐⇒ of the threshold
1 x1and
fixed input value w
−1
theθ value
becomes zero. Itwis1xwhy
· +the
wnfollowing
xn − 1 × θ we
> 0suppose that the
1 + · ·in
thresholds are always equal to zero.
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 1: Artificial neural networks
15
We define now the scalar product of n-dimensional vectors,
which plays a very important role in the theory of neural networks.
Definition 1.2. Let w = (w1 , . . . , wn )T and x = (x1 , . . . , xn )T
be two vectors from Rn . The scalar (or inner) product of w and
x, denoted by < w, x > or wT x, is defined by
n
X
hw, xi = w1 x1 + · · · + wn xn =
w j xj .
j=1
Other definition of scalar product in two dimensional case is
hw, xi = kwkkxk cos(w, x)
where k.k denotes the Eucledean norm in the real plane, i.e.
q
q
kwk = w12 + w22 , kxk = x21 + x22
Toc
JJ
II
J
I
Back
J Doc
Doc I
!w! networks
= w12
Section 1: Artificial neural
+ w22, !x! =
x21 + x22
16
w
x
w2
!
x2
w1
x1
T
w = (w1, w2) and x = (x1, x2)T .
Figure 5: w = (w1 , w2 )T and x = (x1 , x2 )T .
Lemma 1. The following property holds
Lemma 1. The following property holds
hw, xi =<ww,
w2wx12x1 + w2x2
>=
1 x1x +
q
q
2
2
= w1 + w2 x21 + x22 cos(w, x)
Toc
JJ
!
!
kwkkxk
2 + w 2 cos(w,
2 cos(w, x)
==w
x21 + xx).
1
2
2
II
J
I
Back
J Doc
Doc I
Section 1: Artificial neural networks
17
Proof
cos(w, x) = cos((w, 1-st axis) − (x, 1-st axis))
That is,
= cos((w, 1-st axis) cos(x, 1-st axis))+
sin(w, 1-st axis) sin(x, 1-st axis) =
w 1 x1
w2 x2
p
p
p
p
+
w12 + w22 x21 + x22
w12 + w22 x21 + x22
q
q
2
2
kwkkxk cos(w, x) = w1 + w2 x21 + x22 cos(w, x)
= w 1 x 1 + w 2 x2 .
From cos π/2 = 0 it follows that hw, xi = 0 whenever w and x
are perpendicular.
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 1: Artificial neural networks
18
Then, the actual output is compared with the deIf kwksired
= 1 target,
(we sayand
thatawmatch
is normalized)
thenIf|hw,
is computed.
thexi|
out-is nothing else
but
the
projection
of
x
onto
the
direction
of
put and target match, no change is made to w.
theReally,
if kwknet.
= 1However,
then we get
if the output differs from the targethw,
a change
must
madex)to=some
of the x)
conxi = kwkkxkbe
cos(w,
kxk cos(w,
nections.
w
<w,x>
x
Projection of x onto the direction of w.
Figure 6: Projection of x onto the direction of w.
The perceptron
learning
rule,networks
introduced
by RosenThe problem
of learning
in neural
is simply
the probblatt,
is
a
typical
error
correction
learning
algorithm
lem of finding a set of connection strengths (weights) which
of single-layer feedforward networks with linear threshToc
JJ
II
J
I
Back
J Doc
Doc I
19
allow the network to carry out the desired computation. The
network is provided with a set of example input/output pairs (a
training set) and is to modify its connections in order to approximate the function from which the input/output pairs have been
drawn. The networks are then tested for ability to generalize.
2. The perception learning rule
The error correction learning procedure is simple enough in
conception. The procedure is as follows: During training an
input is put into the network and flows through the network
generating a set of values on the output units. Then, the actual output is compared with the desired target, and a match is
computed. If the output and target match, no change is made to
the net. However, if the output differs from the target a change
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 2: The perception learning rule
20
must be made to some of the connections.
The perceptron learning rule, introduced by Rosenblatt, is a
typical error correction learning algorithm of single-layer feedforward networks with linear threshold activation function.
!m
!1
w11
wmn
x1
x2
xn
Single-layer feedforward network.
Figure 7: Single-layer feedforward network.
JJ
II
Jthe weight
I
Doc j-th
Doc I
Toc
Back
from Jthe
input
Usually,
w denotes
Section 2: The perception learning rule
21
Usually, wij denotes the weight from the j-th input unit to the
i-th output unit and wi denotes the weight vector of the i-th
output node. We are given a training set of input/output pairs,
No.
input values
desired output values
1.
..
.
x1 = (x11 , . . . x1n )
..
.
1
y 1 = (y11 , . . . , ym
)
..
.
K.
K
xK = (xK
1 , . . . xn )
K
y K = (y1K , . . . , ym
)
Our problem is to find weight vectors wi such that
oi (xk ) = sign(< wi , xk >) = yik , i = 1, . . . , m
for all training patterns k.
The activation function of the output nodes is linear threshold
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 2: The perception learning rule
22
function of the form
1 if < wi , x > ≥ 0
0 if < wi , x > < 0
and the weight adjustments in the perceptron learning method
are performed by
oi (x) =
for i = 1, . . . , m.
wi := wi + η(yi − oi )x,
That is,
wij := wij + η(yi − oi )xj ,
for j = 1, . . . , n, where η > 0 is the learning rate.
From this equation it follows that if the desired output is equal
to the computed output, yi = oi , then the weight vector of the
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 2: The perception learning rule
23
i-th output node remains unchanged, i.e. wi is adjusted if and
only if the computed output, oi , is incorrect. The learning stops
when all the weight vectors remain unchanged during a complete training cycle.
Consider now a single-layer network with one output node. We
are given a training set of input/output pairs
No.
input values
desired output values
1.
..
.
x1 = (x11 , . . . x1n )
..
.
y1
..
.
K.
K
xK = (xK
1 , . . . xn )
yK
Then the input components of the training patterns can be clasToc
JJ
II
J
I
Back
J Doc
Doc I
Section 2: The perception learning rule
24
sified into two disjunct classes
C1 = {xk |y k = 1},
C2 = {xk |y k = 0}
i.e. x belongs to class C1 if there exists an input/output pair
(x, 1) and x belongs to class C2 if there exists an input/output
pair (x, 0).
x1
w1
f
xn
! = f(net)
wn
AFigure
simple
processing element.
8: A simple processing element.
Taking into consideration the definition of the activation func-
Taking
into
consideration
the
definition
of the actiJJ
II
J
I
J Doc Doc I
Toc
Back
Section 2: The perception learning rule
25
tion it is easy to see that we are searching for a weight vector w
such that
hw, xi > 0 for each x ∈ C1
and
hw, xi < 0 for each x ∈ C2 .
If such vector exists then the problem is called linearly separable.
If the problem is linearly separable then we can always suppose
that the line separating the classes crosses the origin.
And the problem can be further transformed to the problem of
finding a line, for which all the elements of C1 ∪ (−C2 ) can be
found on the positive half-plane.
So, we search for a weight vector w such that hw, xi > 0 for
each x ∈ C1 , and hw, xi < 0 for each x ∈ C2 . That is, hw, xh>
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 2: The perception learning rule
26
x2
C1
w
x1
C2
A two-class linearly separable classification problem.
Figure 9: A two-class linearly separable classification problem.
If the problem is linearly separable then we can al0 for ways
each xsuppose
∈ C1 and
hw,the
−xi
> separating
0 for each the
− xclasses
∈ C2 . Which
that
line
can be
written
in
the
form,
hw,
xi
>
for
each
x
∈
C1 ∪ −C2 .
crosses the origin.
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 2: The perception learning rule
27
x2
C1
w
x1
C2
Shifting the origin.
Figure 10: Shifting the origin.
And the problem can be further transformed to the
And theproblem
weight ofadjustment
in the
perceptron
finding a line,
for which
all thelearning
elements rule is
performed
by
of C
∪
(−C
)
can
be
found
on
the
positive
half1
2
plane.
wj := wj + η(y − o)xj .
where η > 0 is the learning rate, y = 1 is he desired output,
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 2: The perception learning rule
28
x2
-C 2
C1
w
x1
C1 ∪ (−C2) is on the postive side of the line.
Figure 11: C1 ∪ (−C2 ) is on the postive side of the line.
So, we are searching for a weight vector w such that
o ∈ {0, 1} is the<computed
x is
theC actual
input to the
w, x > > output,
0 for each
x∈
1,
neuron.
That is,
Toc
JJ
< w, x > < 0 for each x ∈ C2.
II
J
I
Back
J Doc
Doc I
Section 2: The perception learning rule
29
x
!x
w'
new hyperplane
w
0
old hyperplane
Illustration of the perceptron learning algorithm.
Figure 12: Illustration of the perceptron learning algorithm.
Summary 1. The perceptron learning algorithm: Given
are K training pairs arranged in the training set
Toc
JJ
II
J
I
Back
J Doc
Doc I
30
3. The perceptron learning algorithm
Given are K training pairs arranged in the training set
(x1 , y 1 ), . . . , (xK , y K )
k
where xk = (xk1 , . . . , xkn ), y k = (y1k , . . . , ym
), k = 1, . . . , K.
• Step 1 η > 0 is chosen
• Step 2 Weigts wi are initialized at small random values,
the running error E is set to 0, k := 1
• Step 3 Training starts here. xk is presented, x := xk ,
y := y k and output o = o(x) is computed
oi =
Toc
JJ
II
1 if < wi , x > > 0
0 if < wi , x > < 0
J
I
Back
J Doc
Doc I
Section 3: The perceptron learning algorithm
31
• Step 4 Weights are updated
wi := wi + η(yi − oi )x, i = 1, . . . , m
• Step 5 Cumulative cycle error is computed by adding the
present error to E
1
E := E + ky − ok2
2
• Step 6 If k < K then k := k + 1 and we continue the
training by going back to Step 3, otherwise we go to Step
7
• Step 7 The training cycle is completed. For E = 0 terminate the training session. If E > 0 then E is set to 0,
k := 1 and we initiate a new training cycle by going to
Step 3
Toc
JJ
II
J
I
Back
J Doc
Doc I
32
The following theorem shows that if the problem has solutions
then the perceptron learning algorithm will find one of them.
Theorem 3.1. (Convergence theorem) If the problem is linearly
separable then the program will go to Step 3 only finetely many
times.
4. Illustration of the perceptron learning algorithm
Consider the following training set
No.
1.
2.
3.
Toc
input values
x1 = (1, 0, 1)T
x2 = (0, −1, −1)T
x3 = (−1, −0.5, −1)T
JJ
II
J
desired output value
-1
1
1
I
Back
J Doc
Doc I
Section 4: Illustration of the perceptron learning algorithm
33
The learning constant is assumed to be 0.1. The initial weight
vector is w0 = (1, −1, 0)T .
Then the learning according to the perceptron learning rule progresses as follows.
• [Step 1] Input x1 , desired output is -1:


1
< w0 , x1 >= (1, −1, 0)  0  = 1
1
Correction in this step is needed since
y1 = −1 6= sign(1).
We thus obtain the updated vector
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 4: Illustration of the perceptron learning algorithm
34
w1 = w0 + 0.1(−1 − 1)x1
Plugging in numerical values we obtain


  

1
1
0.8
w1 =  −1  − 0.2  0  =  −1 
0
1
−0.2
• [Step 2] Input is x2 , desired output is 1. For the present
w1 we compute the activation value

0
< w1 , x2 >= (0.8, −1, −0.2)  −1  = 1.2
−1

Correction is not performed in this step since 1 = sign(1.2),
so we let w2 := w1 .
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 4: Illustration of the perceptron learning algorithm
35
• [Step 3] Input is x3 , desired output is 1.


−1
< w2 , x3 >= (0.8, −1, −0.2)  −0.5  = −0.1
−1
Correction in this step is needed since y3 = 1 6= sign(−0.1).
We thus obtain the updated vector
w3 = w2 + 0.1(1 + 1)x3
Plugging in numerical values we obtain

Toc


 

0.8
−1
0.6
w3 =  −1  + 0.2  −0.5  =  −1.1 
−0.2
−1
−0.4
JJ
II
J
I
Back
J Doc
Doc I
Section 4: Illustration of the perceptron learning algorithm
36
• [Step 4] Input x1 , desired output is -1:


1
< w3 , x1 >= (0.6, −1.1, −0.4)  0  = 0.2
1
Correction in this step is needed since
y1 = −1 6= sign(0.2).
We thus obtain the updated vector
w4 = w3 + 0.1(−1 − 1)x1
Plugging in numerical values we obtain

  

0.6
1
0.4
w4 =  −1.1  − 0.2  0  =  −1.1 
−0.4
1
−0.6

Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 4: Illustration of the perceptron learning algorithm
37
• [Step 5] Input is x2 , desired output is 1. For the present
w4 we compute the activation value


0
< w4 , x2 >= (0.4, −1.1, −0.6)  −1  = 1.7
−1
Correction is not performed in this step since 1 = sign(1.7),
so we let w5 := w4 .
• [Step 6] Input is x3 , desired output is 1.


−1
< w5 , x3 >= (0.4, −1.1, −0.6)  −0.5  = 0.75
−1
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 4: Illustration of the perceptron learning algorithm
38
Correction is not performed in this step since 1 = sign(0.75),
so we let w6 := w5 .
This terminates the learning process, because
< w6 , x1 >= −0.2 < 0,
< w6 , x2 >= 1.70 > 0,
< w6 , x3 >= 0.75 > 0.
Minsky and Papert (1969) provided a very careful analysis of
conditions under which the perceptron learning rule is capable
of carrying out the required mappings.
They showed that the perceptron can not succesfully solve the
problem
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 4: Illustration of the perceptron learning algorithm
1.
2.
3.
4.
39
x1
x1
o(x1 , x2 )
1
1
0
0
1
0
1
0
0
1
1
0
This Boolean function is know in the literature az exclusive
(XOR).
Sometimes we call the above function as two-dimensional parity function.
The n-dimensional parity function is a binary Boolean function,
which takes the value 1 if we have odd number of 1-s in the
input vector, and zero otherwise.
Toc
JJ
II
J
I
Back
J Doc
Doc I
exclusive
(XOR).
Section 4: Illustration
of the perceptron learning algorithm
40
Figure 13: Illustration of exclusive or.
Sometimes we call the above function as two-dimensi
For example,
the following function is the 3-dimensional parity
parity
function.
function
The n-dimensional parity function is a binary Boolean
function, which takes the value 1 if we have odd
Toc
JJ
II
J
I
Back
J Doc
Doc I
Section 4: Illustration of the perceptron learning algorithm
1.
2.
3.
4.
5.
6.
7.
8.
Toc
JJ
x1
1
1
1
1
0
0
0
0
II
x1
1
1
0
0
0
1
1
0
J
x3
1
0
1
0
1
1
0
0
41
o(x1 , x2 , x3 )
1
0
0
1
1
0
1
0
I
Back
J Doc
Doc I