Ch 3. Neural Network

INTRODUCTION TO
NEURAL NETWORKS AND
LEARNING
What is Neural Network?



Computational model inspired by biological Neuron, cells
perform information processing in brain
Systems are trained rather than programmed to accomplish
tasks
Successful at solving problems proven difficult or impossible by
using conventional computing techniques
* Formal Definition : A Neural Network is a non-programmed adaptive
information processing system based upon research in how the brain
encodes and processes information
Biological Neuron
Excerpted from Artificial Intelligence: Modern Approach
by S. Russel and P. Norvig
Properties of Biological Neuron and Brain

Neurotransmission


by means of electrical impulse effected by chemical transmitter
Period of latent summation



generate impulse if total potential of membrane reaches a level : firing
excitatory - inhibitory
Human Brain

Cerebral cortex




Areas of Brain for specific function
1011 neurons
104 synapses per neuron
10-3 sec cycle time (computer : 10-8 sec)
Computational Model of a Neuron
SOMA
wi

Dendrites
(input)
T
Summer
synapses
N
s (x)   wi xi
i 1
Threshold
f s (s)
1
0
s
Axons
(output)
What Makes Up a Neural Network?
Neural Net
Neurobiology
Processing Element
Neuron
Interconnections Scheme
Dendrites and Axons
Learning Law
Neuro Transmitters
Neural Net vs. Von Neumann Computer
Neural Net
Von Neumann
Non-algorithmic
Algorithmic
Trained
Programmed with instructions
Memory and processing elements the same
Memory and processing separate
Pursue multiple hypotheses simultaneously
Pursue one hypothesis at a time
Fault tolerant
Non fault tolerant
Non-logical operation
Highly logical operation
Adaptation or learning
Algorithmic parameterization modification only
Seeks answer by finding minima in solution space
Seeks answer by following logical tree structure
What is Learning ?

Concepts of inductive learning (learning from example)


Example : (x,f(x))
Inductive inference


Hypothesis : h



[collection of examples of (x,f(X))]  h(X)
approximation of f, the agent’s belief about f ( hf )
hypothesis space : set of all hypothesis writable.
Bias


Any preference for one hypothesis over another.
there’re many consistent hypotheses.

Examples of biased hypothesis
(a)
examples

(b)
hypothesis1
(c)
hypothesis2
(d)
hypothesis3
Representation of functions



expressiveness : perceptron can’t learn XOR
efficiency
: # of examples for good generalization
‘a good set of sentences’
Learning Procedure
1.
2.
3.
4.
5.
Collect a large set of examples
Divide it into two disjoint sets :
training set & test set
Use the learning algorithm with the training set as examples to generate
a hypothesis H
Measure the percentage of examples in the test set that are correctly
classified by H.
Repeat steps 1-4 for different sizes of training sets & different
randomly selected training sets of each size
SINGLE THRESHOLD LOGIC
UNIT(TLU)
Introduction :
S-R Learning with TLU/NN



Experiences
 sensory data set
paired with proper action
Knowledge
 function
from sensory data
to proper action
Representation
 1. single TLU with adjustable weights

2. Multi layer perceptron
E  {( X , a) | X  , a {0 / 1}}
  { X | X  ( x1 ,..., xi ,..., xn )}
f :X a
Training Single TLU

TLU geometry

TLU definition


internal knowledge representation
an abstract computation tool that calculates
Input :
f ( X )  f s (W  X   )
X
output : f (X )
transfer : f s (s)
weight :
threshold :
f s (s)
1
0
s
W


Geometric Interpretation of TLU computation

0 if X is in one side of a hyperplane,
1 if X is in the other side of the hyperplane.
A hyper plane in Rn space:
Y  W  X   0
Training Single TLU : Gradient Descent
Method

Definition
s  W  X    ( w1 , w2 ,..., wn )  ( x1 , x2 ,..., xn )  
 W ' X '  ( w1 , w2 ,..., wn , )  ( x1 , x2 ,..., xn ,1)

Gradient Descent Methods

Learning is a search over internal representations,
which maximize/minizie some evaluation function :

2
(
d

f
)
 i i
X i 


TLU output :
desired output :
ε can be minimized by gradient descent : greedy method
How to update W??

Incremental learning : adjust W that slightly reduce e for one Xi

Batch learning : adjust W that reduce e for all Xi
f i  f s (W  X i )
di
  (d  f ) 2
W W c

W
– Gradient descent learning rule (Single TLU, Incremental)
Gradient
by Definition
 def  

 

,...,
,...,

W  w1
wi
wn 1 
  (d  f ) 2  (d  f s (W  X ))2
s W  X
By Chainrule

 s

W s W



X
W s
s
W  X

X
W
W
f
 (d  f s ) 2

 2(d  f s ) s
s
s
s

f
 2(d  f )
X
W
s
case 1)

 2(d  f ) X
W

 2(d  f ) f (1  f ) X
W
case 2)
f s ( s)  s
f s (s) 
1
1  es
f s
1
s
f s
 f s (1  f s )
s
Training Single TLU : Widrow-Hoff
Procedure
Activation function:
f s ( s)  s
f s ( s)  s

 2(d  f ) X
W
W '  W  12 c

W
s

W '  W  c( d  f ) X
W
W' W
Training Single TLU : Generalized Delta
Activation function:
1
f s (s) 
1  es
f s (s) 
1
1  es
Sigmoid function

 2(d  f ) f (1  f ) X
W
W '  W  12 c

W
W '  W  c(d  f ) f (1  f ) X
f   f (1  f )
Training Single TLU : Error Correction
Procedure

f s (s)
Learning rule



Activation function : 0/1 threshold function
Adjust weights, only when (d-f) = (1 or -1)
Use the same weight update gradient
1
0
s
W '  W  c( d  f ) X

Termination of Learning Procedure
 Error Correction Procedure



If there exists some weight vector W, that produces a correct output for all input vectors,
the error-correction procedure will find such a weight vector and terminate.
If there exists no such vector W, error-correction procedure will never terminate.
Widrow-hoff and Generalized Delta procedures
 Minimum squared-error solutions are found even when there exist no perfect
solutions W.
– Example problem
0/1 situations are linearly separable!
Home work !!
NEURAL NETWORKS
Neural Networks

Why Neural Networks?




A single TLU is not enough !
There are sets of stimuli and responses that cannot be learned by a
single TLU.
(non linearly separable functions)
Let’s use a network of TLUs !
Structure of Neural Networks
Feed forward net: There is no circuit in the net, output value
is dependent on only input values.
 Recurrent net : There are circuits in the net, output value is
dependent on input & history.

An Example of a Neural Network

3 layer feed forward network
 Input
layer
 Hidden layer
 Output layer
f  x1 x2  x1 x2
Neural Network: Notations

j-th Layer output vector
Input vector
Final layer output vector

Weight vector of i-th TLU in j-th layer

Wi ( j ) Wi ( j )  [wl(,ij ) ]l  1,2,..., m( j 1)  1
Input to a i-th TLU in j-th layer (activation)




si( j )
:
:
:
X ( j)
X (0)  input
X (k )  f
si( j )  X ( j 1) Wi ( j )
Number of TLUs in j-th layer
: mj
A general, k-layer feed forward network
Backpropagation Method(1/3)

Gradient of squared error with respect to weight vector Wi ( j )
 def  

 



,..., ( j ) ,..., ( j )
( j)
( j)
Wi
wl ,i
wm j 1 1,i 
 w1,i


Using activation variable and the chain rule

 si( j )



X ( j 1)
( j)
( j)
( j)
( j)
Wi
si Wi
si
( j)
i
s
Let activation error influence
 i( j )  (d  f )

f
si( j )
 i( j )

 2 i( j )  X ( j 1)
( j)
Wi
Weight update rule is derived:
Wi ( j )  Wi ( j )  ci( j ) i( j ) X ( j 1)
Wi ,
( j)
si( j )
( j 1)

X
Wi ( j )

 (d  f ) 2
f



2(
d

f
)
si( j )
si( j )
si( j )

f


2
(
d

f
)
X ( j 1)
( j)
( j)
Wi
si

X
( j 1)
Backpropagation Method(2/3)

Weight update rule in final layer

By definition
 ( k )  (d  f )

f
s ( k )
Since f is the sigmoid function of s (k )
 ( k )  (d  f ) f (1  f )

f
 f (1  f )
s ( k )
Backpropagation weight adjustment rule for the single TLU in the final
layer:
Wi ( k )  Wi ( k )  ci( k ) (d  f ) f (1  f ) X ( k 1)
Wi ( j )  Wi ( j )  ci( j ) i( j ) X ( j 1)
Backpropagation Method(3/3)

Weight update rule in intermediate layers

Using chain rule
 i( j )
( j 1)
( j 1)
 f s ( j 1)


s

s
f

f

f
m
j 1
l
1
 (d  f ) ( j )  (d  f )  ( j 1)
 ...  ( j 1)
 ...  ( j 1)

( j)
( j)
( j)
si

s

s

s

s

s

s
 1

i
l
i
m j 1
i
f sl( j 1)
  (d  f ) ( j 1)
sl
si( j )
l 1
m j 1

( j)
l
m j 1
 
l 1
m j 1
( j 1)
l
( j)
i
s
s

[  f v( j )  wv( ,jl1) ]
v 1
s
( j)
i
( j 1)
l
( j)
i
s
s

( j)
i
m j 1
 
l 1
( j 1)
l

m j 1
w
v 1
( j 1) ( j )
i ,l
i
w
w
( j 1) ( j )
v ,l
i
f
f
( j 1)
v ,l
sl( j 1)
si( j )
( j)
f v( j )
( j 1) f i
 ( j )  wi ,l  ( j )
si
si
( j 1)
l
s
(1  fi )
( j)

( j 1)
l
m j 1
X
( j)
Wl
(1  f i )  f i (1  f i )   l( j 1) wv( ,jl1)
Wi ( j )  Wi ( j )  ci( j ) i( j ) X ( j 1)
( j)
( j)
( j)
l 1
( j 1)

m j 1
f
v 1
( j)
v
 wv( ,jl1)
Recurrent Neural Networks
: Hopfield Network



Proper when exact binary representations are
possible.
Can be used as an associative memory or to solve
optimization problems.
The number of classes (M) must be kept smaller
than 0.15 times the number of nodes (N).
M  0.15 N
( N  100 , M  15)
OUTPUTS(Valid After Convergence)
x’0
x’1
x’N-2 x’N-1
. . . . .
x0
x1
xN-2
INPUTS(Applied At Time Zero)
xN-1
Recurrent Neural Networks
: Hopfield Network Algorithm(1/2)

Step 1 : Assign Connection Weights.
M 1
Tij 
s s
x
 i xj
i j
0
i0
s 0
Tij is the connection weight from node i to node j,
xis  1, or  1 (i  th element of class s)

Step 2 : Initialize with unknown input pattern.
mi (0)  xi
0  i  N 1
mi (t ) : output of node i at time t
xi : i  th element of the input pattern
Recurrent Neural Networks
:Hopfield Network Algorithm (2/2)

Step 3 : Iterate until convergence.
Fh (.)
 N 1

mi (t  1)  Fh  Tij m j (t )
 j 0

Fh : hard limiter

Step 4 : goto step 2.
1
0
-1
Recurrent Neural Networks
: Hamming Net

Optimum minimum error classifier



Calculate Hamming distance to the exemplar for each class and selects that class
with minimum Hamming distance
Hamming distance = number of different bits
Advantages Over Hopfield Net:



Hopfield Net is worse than or equivalent to Hamming Net
Hamming Net requires fewer connections
The number of connections in Hamming Net grows linearly
N=100, M=10
N2
vs
10000
NM + M2 → M(N+M)
1100
≈ NM (1000)
N >> M
OUTPUT(valid after MAXNET converge)
Y0
Y1
YM-2
YM-1
(Class)
MAXNET
PICKS
MAXIMUM
Tkl
CALCULATE
MATCHING
SCORES
Wij
x0
x1
xN-2 xN-1
INPUT(applied at time zero)
(Data)
Recurrent Neural Networks
: Hamming Net Algorithm

Step 1. Assign Connection Weights and offsets
in the lower subnet :
xij
N
wij 
, j 
2
2
0  i  N  1, 0  j  M  1
in the upper subnet : for lateral inhibition
k l
1

t kl  
1
  k  l ,   M
0  k, l  M 1
wij : connection weight from input i to node j
in the lower subnet
t kl : connection weight from node k to node l

Step2. Initialize with unknown input pattern
N 1
 j (0)  f t ( wij xi   j ) , 0  j  M  1
i 0
 j (t ) : output of node j in the upper subnet at time t
xi : i  th element of the input

Step3. Iterate until convergence
 j (t  1)  ft (  j (t )    t jk k (t )) , 0  j, k  M  1
k j
This process is repeated until convergence.

Step4. Go to step 2
ft:
1
1
Self Organizing Feature Map

Transform input patterns of arbitrary dimension to discrete
map of lower dimension



clustering
representation by representative
Algorithm
1.initialize w’s
2. find nearest cell
i(x) = argminj || x(n) - wj(n) ||
3. update weights of i(x) and its neighbors
wj(n+1) = wj(n) +  (n) [ x(n) - wj(n) ]
4. reduce neighbors and 
5. Go to 2
x1
x2
NE(n)
j
NE(n+1)
SOFM Example




Input sample : random numbers within 2-D unit square
100 neurons ( 10x10)
Initial weights : random assignment (0.0~1.0)
Display
each neuron positioned at w1, w2
 neighbors are connected by line

Input samples
Initial weights
After 25 iterations
After 1000 iterations After 10000 iterations
 Programming Assignment!!