INTRODUCTION TO
NEURAL NETWORKS AND
LEARNING
What is Neural Network?
Computational model inspired by biological Neuron, cells
perform information processing in brain
Systems are trained rather than programmed to accomplish
tasks
Successful at solving problems proven difficult or impossible by
using conventional computing techniques
* Formal Definition : A Neural Network is a non-programmed adaptive
information processing system based upon research in how the brain
encodes and processes information
Biological Neuron
Excerpted from Artificial Intelligence: Modern Approach
by S. Russel and P. Norvig
Properties of Biological Neuron and Brain
Neurotransmission
by means of electrical impulse effected by chemical transmitter
Period of latent summation
generate impulse if total potential of membrane reaches a level : firing
excitatory - inhibitory
Human Brain
Cerebral cortex
Areas of Brain for specific function
1011 neurons
104 synapses per neuron
10-3 sec cycle time (computer : 10-8 sec)
Computational Model of a Neuron
SOMA
wi
Dendrites
(input)
T
Summer
synapses
N
s (x) wi xi
i 1
Threshold
f s (s)
1
0
s
Axons
(output)
What Makes Up a Neural Network?
Neural Net
Neurobiology
Processing Element
Neuron
Interconnections Scheme
Dendrites and Axons
Learning Law
Neuro Transmitters
Neural Net vs. Von Neumann Computer
Neural Net
Von Neumann
Non-algorithmic
Algorithmic
Trained
Programmed with instructions
Memory and processing elements the same
Memory and processing separate
Pursue multiple hypotheses simultaneously
Pursue one hypothesis at a time
Fault tolerant
Non fault tolerant
Non-logical operation
Highly logical operation
Adaptation or learning
Algorithmic parameterization modification only
Seeks answer by finding minima in solution space
Seeks answer by following logical tree structure
What is Learning ?
Concepts of inductive learning (learning from example)
Example : (x,f(x))
Inductive inference
Hypothesis : h
[collection of examples of (x,f(X))] h(X)
approximation of f, the agent’s belief about f ( hf )
hypothesis space : set of all hypothesis writable.
Bias
Any preference for one hypothesis over another.
there’re many consistent hypotheses.
Examples of biased hypothesis
(a)
examples
(b)
hypothesis1
(c)
hypothesis2
(d)
hypothesis3
Representation of functions
expressiveness : perceptron can’t learn XOR
efficiency
: # of examples for good generalization
‘a good set of sentences’
Learning Procedure
1.
2.
3.
4.
5.
Collect a large set of examples
Divide it into two disjoint sets :
training set & test set
Use the learning algorithm with the training set as examples to generate
a hypothesis H
Measure the percentage of examples in the test set that are correctly
classified by H.
Repeat steps 1-4 for different sizes of training sets & different
randomly selected training sets of each size
SINGLE THRESHOLD LOGIC
UNIT(TLU)
Introduction :
S-R Learning with TLU/NN
Experiences
sensory data set
paired with proper action
Knowledge
function
from sensory data
to proper action
Representation
1. single TLU with adjustable weights
2. Multi layer perceptron
E {( X , a) | X , a {0 / 1}}
{ X | X ( x1 ,..., xi ,..., xn )}
f :X a
Training Single TLU
TLU geometry
TLU definition
internal knowledge representation
an abstract computation tool that calculates
Input :
f ( X ) f s (W X )
X
output : f (X )
transfer : f s (s)
weight :
threshold :
f s (s)
1
0
s
W
Geometric Interpretation of TLU computation
0 if X is in one side of a hyperplane,
1 if X is in the other side of the hyperplane.
A hyper plane in Rn space:
Y W X 0
Training Single TLU : Gradient Descent
Method
Definition
s W X ( w1 , w2 ,..., wn ) ( x1 , x2 ,..., xn )
W ' X ' ( w1 , w2 ,..., wn , ) ( x1 , x2 ,..., xn ,1)
Gradient Descent Methods
Learning is a search over internal representations,
which maximize/minizie some evaluation function :
2
(
d
f
)
i i
X i
TLU output :
desired output :
ε can be minimized by gradient descent : greedy method
How to update W??
Incremental learning : adjust W that slightly reduce e for one Xi
Batch learning : adjust W that reduce e for all Xi
f i f s (W X i )
di
(d f ) 2
W W c
W
– Gradient descent learning rule (Single TLU, Incremental)
Gradient
by Definition
def
,...,
,...,
W w1
wi
wn 1
(d f ) 2 (d f s (W X ))2
s W X
By Chainrule
s
W s W
X
W s
s
W X
X
W
W
f
(d f s ) 2
2(d f s ) s
s
s
s
f
2(d f )
X
W
s
case 1)
2(d f ) X
W
2(d f ) f (1 f ) X
W
case 2)
f s ( s) s
f s (s)
1
1 es
f s
1
s
f s
f s (1 f s )
s
Training Single TLU : Widrow-Hoff
Procedure
Activation function:
f s ( s) s
f s ( s) s
2(d f ) X
W
W ' W 12 c
W
s
W ' W c( d f ) X
W
W' W
Training Single TLU : Generalized Delta
Activation function:
1
f s (s)
1 es
f s (s)
1
1 es
Sigmoid function
2(d f ) f (1 f ) X
W
W ' W 12 c
W
W ' W c(d f ) f (1 f ) X
f f (1 f )
Training Single TLU : Error Correction
Procedure
f s (s)
Learning rule
Activation function : 0/1 threshold function
Adjust weights, only when (d-f) = (1 or -1)
Use the same weight update gradient
1
0
s
W ' W c( d f ) X
Termination of Learning Procedure
Error Correction Procedure
If there exists some weight vector W, that produces a correct output for all input vectors,
the error-correction procedure will find such a weight vector and terminate.
If there exists no such vector W, error-correction procedure will never terminate.
Widrow-hoff and Generalized Delta procedures
Minimum squared-error solutions are found even when there exist no perfect
solutions W.
– Example problem
0/1 situations are linearly separable!
Home work !!
NEURAL NETWORKS
Neural Networks
Why Neural Networks?
A single TLU is not enough !
There are sets of stimuli and responses that cannot be learned by a
single TLU.
(non linearly separable functions)
Let’s use a network of TLUs !
Structure of Neural Networks
Feed forward net: There is no circuit in the net, output value
is dependent on only input values.
Recurrent net : There are circuits in the net, output value is
dependent on input & history.
An Example of a Neural Network
3 layer feed forward network
Input
layer
Hidden layer
Output layer
f x1 x2 x1 x2
Neural Network: Notations
j-th Layer output vector
Input vector
Final layer output vector
Weight vector of i-th TLU in j-th layer
Wi ( j ) Wi ( j ) [wl(,ij ) ]l 1,2,..., m( j 1) 1
Input to a i-th TLU in j-th layer (activation)
si( j )
:
:
:
X ( j)
X (0) input
X (k ) f
si( j ) X ( j 1) Wi ( j )
Number of TLUs in j-th layer
: mj
A general, k-layer feed forward network
Backpropagation Method(1/3)
Gradient of squared error with respect to weight vector Wi ( j )
def
,..., ( j ) ,..., ( j )
( j)
( j)
Wi
wl ,i
wm j 1 1,i
w1,i
Using activation variable and the chain rule
si( j )
X ( j 1)
( j)
( j)
( j)
( j)
Wi
si Wi
si
( j)
i
s
Let activation error influence
i( j ) (d f )
f
si( j )
i( j )
2 i( j ) X ( j 1)
( j)
Wi
Weight update rule is derived:
Wi ( j ) Wi ( j ) ci( j ) i( j ) X ( j 1)
Wi ,
( j)
si( j )
( j 1)
X
Wi ( j )
(d f ) 2
f
2(
d
f
)
si( j )
si( j )
si( j )
f
2
(
d
f
)
X ( j 1)
( j)
( j)
Wi
si
X
( j 1)
Backpropagation Method(2/3)
Weight update rule in final layer
By definition
( k ) (d f )
f
s ( k )
Since f is the sigmoid function of s (k )
( k ) (d f ) f (1 f )
f
f (1 f )
s ( k )
Backpropagation weight adjustment rule for the single TLU in the final
layer:
Wi ( k ) Wi ( k ) ci( k ) (d f ) f (1 f ) X ( k 1)
Wi ( j ) Wi ( j ) ci( j ) i( j ) X ( j 1)
Backpropagation Method(3/3)
Weight update rule in intermediate layers
Using chain rule
i( j )
( j 1)
( j 1)
f s ( j 1)
s
s
f
f
f
m
j 1
l
1
(d f ) ( j ) (d f ) ( j 1)
... ( j 1)
... ( j 1)
( j)
( j)
( j)
si
s
s
s
s
s
s
1
i
l
i
m j 1
i
f sl( j 1)
(d f ) ( j 1)
sl
si( j )
l 1
m j 1
( j)
l
m j 1
l 1
m j 1
( j 1)
l
( j)
i
s
s
[ f v( j ) wv( ,jl1) ]
v 1
s
( j)
i
( j 1)
l
( j)
i
s
s
( j)
i
m j 1
l 1
( j 1)
l
m j 1
w
v 1
( j 1) ( j )
i ,l
i
w
w
( j 1) ( j )
v ,l
i
f
f
( j 1)
v ,l
sl( j 1)
si( j )
( j)
f v( j )
( j 1) f i
( j ) wi ,l ( j )
si
si
( j 1)
l
s
(1 fi )
( j)
( j 1)
l
m j 1
X
( j)
Wl
(1 f i ) f i (1 f i ) l( j 1) wv( ,jl1)
Wi ( j ) Wi ( j ) ci( j ) i( j ) X ( j 1)
( j)
( j)
( j)
l 1
( j 1)
m j 1
f
v 1
( j)
v
wv( ,jl1)
Recurrent Neural Networks
: Hopfield Network
Proper when exact binary representations are
possible.
Can be used as an associative memory or to solve
optimization problems.
The number of classes (M) must be kept smaller
than 0.15 times the number of nodes (N).
M 0.15 N
( N 100 , M 15)
OUTPUTS(Valid After Convergence)
x’0
x’1
x’N-2 x’N-1
. . . . .
x0
x1
xN-2
INPUTS(Applied At Time Zero)
xN-1
Recurrent Neural Networks
: Hopfield Network Algorithm(1/2)
Step 1 : Assign Connection Weights.
M 1
Tij
s s
x
i xj
i j
0
i0
s 0
Tij is the connection weight from node i to node j,
xis 1, or 1 (i th element of class s)
Step 2 : Initialize with unknown input pattern.
mi (0) xi
0 i N 1
mi (t ) : output of node i at time t
xi : i th element of the input pattern
Recurrent Neural Networks
:Hopfield Network Algorithm (2/2)
Step 3 : Iterate until convergence.
Fh (.)
N 1
mi (t 1) Fh Tij m j (t )
j 0
Fh : hard limiter
Step 4 : goto step 2.
1
0
-1
Recurrent Neural Networks
: Hamming Net
Optimum minimum error classifier
Calculate Hamming distance to the exemplar for each class and selects that class
with minimum Hamming distance
Hamming distance = number of different bits
Advantages Over Hopfield Net:
Hopfield Net is worse than or equivalent to Hamming Net
Hamming Net requires fewer connections
The number of connections in Hamming Net grows linearly
N=100, M=10
N2
vs
10000
NM + M2 → M(N+M)
1100
≈ NM (1000)
N >> M
OUTPUT(valid after MAXNET converge)
Y0
Y1
YM-2
YM-1
(Class)
MAXNET
PICKS
MAXIMUM
Tkl
CALCULATE
MATCHING
SCORES
Wij
x0
x1
xN-2 xN-1
INPUT(applied at time zero)
(Data)
Recurrent Neural Networks
: Hamming Net Algorithm
Step 1. Assign Connection Weights and offsets
in the lower subnet :
xij
N
wij
, j
2
2
0 i N 1, 0 j M 1
in the upper subnet : for lateral inhibition
k l
1
t kl
1
k l , M
0 k, l M 1
wij : connection weight from input i to node j
in the lower subnet
t kl : connection weight from node k to node l
Step2. Initialize with unknown input pattern
N 1
j (0) f t ( wij xi j ) , 0 j M 1
i 0
j (t ) : output of node j in the upper subnet at time t
xi : i th element of the input
Step3. Iterate until convergence
j (t 1) ft ( j (t ) t jk k (t )) , 0 j, k M 1
k j
This process is repeated until convergence.
Step4. Go to step 2
ft:
1
1
Self Organizing Feature Map
Transform input patterns of arbitrary dimension to discrete
map of lower dimension
clustering
representation by representative
Algorithm
1.initialize w’s
2. find nearest cell
i(x) = argminj || x(n) - wj(n) ||
3. update weights of i(x) and its neighbors
wj(n+1) = wj(n) + (n) [ x(n) - wj(n) ]
4. reduce neighbors and
5. Go to 2
x1
x2
NE(n)
j
NE(n+1)
SOFM Example
Input sample : random numbers within 2-D unit square
100 neurons ( 10x10)
Initial weights : random assignment (0.0~1.0)
Display
each neuron positioned at w1, w2
neighbors are connected by line
Input samples
Initial weights
After 25 iterations
After 1000 iterations After 10000 iterations
Programming Assignment!!
© Copyright 2026 Paperzz