Δw ij w jk

Introduction to Neural Networks
1
What are connectionist neural networks?
•
Connectionism refers to a computer modeling approach to computation that
is loosely based upon the architecture of the brain
•
Many different models, but all include:
– Multiple, individual “nodes” or “units” that operate at the same time (in
parallel)
– A network that connects the nodes together
– Information is stored in a distributed fashion among the links that
connect the nodes
– Learning can occur with gradual changes in connection strength
2
History of Neural Networks (1)
•
Attempts to mimic the human brain date back to work in the 1930s, 1940s,
& 1950s by Alan Turing, Warren McCullough, Walter Pitts, Donald Hebb and
James von Neumann
– 1943 McCulloch-Pitts: neuron as computing element
– 1948 Wiener: cybernetics
– 1949 Hebb: learning rule
•
1957 Rosenblatt at Cornell developed Perceptron, a hardware neural net for
character recognition
•
1959 Widrow and Hoff at Stanford developed Adaline for adaptive control of
noise on telephone lines
– 1960 Widrow-Hoff: least mean square algorithm
3
History of Neural Networks (2)
•
Recession
– 1969 Minsky-Papert: limitations of perceptron model
 Linear Separability in Perceptrons
4
History of Neural Networks (3)
•
Revival, mathematically tied together many of the ideas from previous
research
– 1982 Hopfield: recurrent network model
– 1982 Kohonen: self-organizing maps
– 1986 Rumelhart et. al.: backpropagation
– universial approximation
•
Since then, growth has exploded. Over 80% of “Fortune 500” have neural
net R&D programs
– Thousands of research papers
– Commercial software applications
5
Application with Neural Network
•
Forecasting/Market Prediction: finance and banking
•
Manufacturing: quality control, fault diagnosis
•
Medicine: analysis of electrocardiogram data, RNA & DNA sequencing,
drug development without animal testing
•
Pattern/Image recognition: handwriting recognition, airport bomb detection
•
Optimization: without Simplex
•
Control: process, robotics
6
Comparison of Brains and Traditional
Computers
•
200 billion neurons, 32 trillion
synapses
Element size: 10-6 m
Energy use: 25W
Processing speed: 100 Hz
Parallel, Distributed
Fault Tolerant
Learns: Yes
Intelligent/Conscious: Usually
•
•
•
•
•
•
•
7
•
•
•
•
•
•
•
•
1 billion bytes RAM but trillions of bytes
on disk
Element size: 10-9 m
Energy watt: 30~90W (CPU)
Processing speed: 109 Hz
Serial, Centralized
Generally not Fault Tolerant
Learns: Some
Intelligent/Conscious: Generally No
Biological Inspiration
Idea : To make the computer more robust, intelligent, and learn, …
Let’s model our computer software (and/or hardware) after the brain
“My brain: It's my second favorite organ.”
- Woody Allen, from the movie Sleeper
8
Neurons in the Brain
•
9
Although heterogeneous, at a low level the brain is
composed of neurons
– A neuron receives input from other neurons
(generally thousands) from its synapses
– Inputs are approximately summed
– When the input exceeds a threshold the
neuron sends an electrical spike that travels
from the body, down the axon, to the next
neuron(s)
Biological Neuron
•
•
•
•
•
3 major functional units
– Dendrites
x2
x1
– Cell body
– Axon
Synapse
Amount of signal passing through a neuron
depends on:
w1
– Intensity of signal from feeding neurons
– Their synaptic strengths
– Threshold of the receiving neuron
Hebb rule (plays key part in learning)
– A synapse which repeatedly triggers the
activation of a postsynaptic neuron will grow in
strength, others will gradually weaken
Learn by adjusting magnitudes of synapses’
strengths
g(ξ)
xn
w2
wn
y
ξ
10
Learning in the Brain
•
Brains learn
– Altering strength between neurons
– Creating/deleting connections
•
Hebb’s Postulate (Hebbian Learning)
– When an axon of cell A is near enough to excite a cell B and repeatedly
or persistently takes part in firing it, some growth process or metabolic
change takes place in one or both cells such that A's efficiency, as one of
the cells firing B, is increased
•
Long Term Potentiation (LTP)
– Cellular basis for learning and memory
– LTP is the long-lasting strengthening of the connection between two
nerve cells in response to stimulation
– Discovered in many regions of the cortex
11
Artificial Neurons
(basic computational entities of an ANN)
•
Analogy between artificial and biological
(connection weights represent synapses)
•
In 1958 Rosenblatt introduced mechanics
(perceptron)
•
Input to output: y=g(∑iwixj)
•
Only when sum exceeds the threshold limit will
neuron fire
•
Weights can enhance or inhibit
•
y
g( )
∑w.x
w1
Collective behaviour of neurons is what’s
interesting for intelligent data processing
w2
x3
x1
x2
12
w3
Model of a Neuron
13
Activation Function
f(A)
+1
0
A
-1
Step Function
f(A)
+1
0
A
-1
Sigmoid Function
14
Perceptrons
•
•
•
•
•
•
•
15
Can be trained on a set of examples using a special
learning rule (process)
oi
Weights are changed in proportion to the difference
(error) between target output and perceptron solution
for each example
wij
– Minimize summed square error function:
E = 1/2 ∑p∑i(oi(p) - ti(p))2
with respect to the weights
xj
Error is function of all the weights and forms an
irregular multidimensional complex hyperplane with
many peaks, saddle points and minima
Error minimized by finding set of weights that
correspond to global minimum
Done with gradient descent method – weights
incrementally updated in proportion to δE/δwij
Updating reads: wij(t + 1) = wij(t) – Δwij
Aim is to produce a true mapping for all patterns
g(ξ)
ξ
threshold
Perceptron Structure
16
Learning for Perceptron
1.
2.
Initialize wij with random values
Repeat until wij(t + 1) ≈ wij(t):
•
Pick pattern p from training set
•
Feed input to network and calculate the output
•
Update the weights according to
wij(t + 1) = wij(t) – Δwij
where Δwij = -η δE/δwij.
–
17
When no change (within some accuracy) occurs, the weights are
frozen and network is ready to use on data it has never seen
Example
AND
•
OR
x1
x2
t
x1
x2
t
1
1
1
1
1
1
1
0
0
1
0
1
0
1
0
0
1
1
0
0
0
0
0
0
Perceptron learns these rules easily (ie, sets appropriate weights and threshold)
 w=(w0,w1,w2) = (-1.5,1.0,1.0) and (-0.5,1.0,1.0) where w0 corresponds to the threshold
term
18
Problem & Solution
•
Perceptrons can only perform accurately with linearly
separable classes
– linear hyperplane can place one class of objects on
one side of plane and other class on other
•
ANN research put on hold for 20yrs
•
Solution: additional (hidden) layers of neurons, MLP
architecture
– Able to solve non-linear classification problems
x1
x2
x1
x2
19
Multilayer Perceptrons (MLPs)
•
•
Learning procedure is extension of simple
perceptron algorithm
oi
Response function:
oi=g(∑iwijg(∑kwjkxk))
Which is non-linear so network able to
perform non-linear mappings
wij
hj
•
Theory tells us that a neural network with at least 1wjk
hidden layer can represent any function
•
Vast number of ANN types exist
20
xk
MLP Structure
21
Geometric Interpretation of Perceptron Learning
22
Backpropagation ANNs
•
Most widely used type of network
•
Feedforward
•
Supervised (learns mapping from one data space to another using
examples)
•
Error propagated backwards
•
Versatile. Used for data modelling, classification, forecasting, data and
image compression and pattern recognition.
23
BP Learning Algorithm
24
•
Like Perceptron, uses gradient descent to
minimize error (generalized to case with
hidden layers)
•
Each iteration constitutes two sweeps
•
To minimize Error we need δE/δwij but
also need δE/δwjk (which we get using
the chain rule)
•
Training of MLP using BP can be thought
of as a walk in weight space along an
energy surface, trying to find global
minimum and avoiding local minima
•
Unlike for Perceptron, there is no
guarantee that global minimum will be
reached, but most cases energy
landscape is smooth
Backpropagation Net Structure
25
BP Learning Algorithm
1.
2.
Initialize wij and wjk with random values
Repeat until wij and wjk have converged or the desired performance level
is reached:
•
Pick pattern p from training set
•
Present input and calculate the output
•
Update weights according to:
wij(t + 1) = wij(t) – Δwij
wjk(t + 1) = wjk(t) – Δwjk
where Δw = -η δE/δw.
(…etc…for extra hidden layers)
26
Training
•
Generalization: network’s performance on a
set of test patterns it has never seen before
(lower than on training set)
•
Training set used to let ANN capture features
in data or mapping
•
Initial large drop in error is due to learning,
but subsequent slow reduction is due to:
1. Network memorization (too many
training cycles used)
2. Overfitting (too many hidden nodes)
(network learns individual training
examples and loses generalization
ability)
27
Error (eg SSE)
Testing
Optimum
network
Training
No. of hidden nodes or training cycles
Other Popular ANNs
Some applications may be solved using variety of ANN types, some only via specific.
(problem logistics)
•
Hopfield networks: optimization
Presented with incomplete/noisy pattern, network responds by retrieving
an internally stored pattern it most closely resembles
•
Kohonen networks: (self-organizing)
Trained in an unsupervised manner to form clusters in the data. Used for
pattern classification and data compression
28
Summary of ANN Learning
Artificial Neural Networks
Feedforward
Unsupervised
Kohonen, Hebbian
29
Supervised
MLP, RBF
Recurrent
Unsupervised
ART
Supervised
Elman, Jordan,
Hopfield
홉필드 망: 구조와 작동식
• 제약조건
wij  w ji
wii  0 i
I i , Oi  {0,1}
• 작동식
NETj 
w O  I
ij
i
j
 1 if NET j  T j

O j  O j if NET j  T j

 1 if NET j  T j
30
구조
홉필드 망: 특성과 목적
• 특성
– 피드백이 있는 recurrent네트워크
– 동적 네트워크
• 목적
– 입력에 가장 가까운 패턴 출력
– 응용분야
• 연상기억 (Associative memory)
• 최적화 (Optimization)
31
예 (1)
• 문제 : 두 패턴벡터 x1  (1,1,1,1) , x2  (1,1,1,1) 를 저장
t
t
• 학습 : 연결강도를 w  x1  x1  x2  x2 에 의해 구함
t
0
0  2
 2


2 2
0
 0
w
0 2
2
0


 2

0
0
2


32
t
예 (2)
• 연상실험
1. 학습 데이터의 복구 능력
0
0  2  1
 2
 1

 
 
2 2
0  1
 0
 1
w  x1  
 4 



0 2
2
0 1
1

 
 
 2
  1
  1
0
0
2

 
 
33
예 (3)
• 불완전한 데이터의 복구 능력
0
0  2  1  4 
 2

   
2 2
0  0   2 
 0
hard limiting
w  x1  










0 2
2
0 1
2

   
 2
  1   4 
0
0
2

   
 1
 
 1
  1
 
  1
 
fh(x)
+1
0
x
-1
34
실제 예와 문제점
• 연상시킬 패턴들의 유사도가 적어야 함
• 네트워크의 용량 : 노드수의 약 15%
– 예 : 10개 패턴의 경우 70개 이상의 노드가 필요
– 5000개 이상의 연결 필요
35
Boltzman Machine
• 시뮬레이티드 어닐링
– At temperature T, output value is determined
– Stochastically by Boltzman distribution
– With carefully designed Annealing schedule
• 볼쯔만 분포
P( Ei )    e  Ei / T
• 특성
– 시뮬레이티드 어닐링 등에 의해 통계학적으로 작동하는 신경망
– 전역 최적화가 가능한 네트워크
36
에너지 곡선
37
Self-Organizing Map
• Self-organizing map (SOM)
– Unsupervised learning
– Preserves the topology of data
• Widely used in data visualization or topology-preserving
mapping
• Selection of winner:
• Weight update
x  mc  min{ x  m }
i
i
mi (t  1)  mi (t )   (t )  nci (t )  {x(t )  mi (t )}
38
SOM Structure
39
SASOM
1. Start with a basic SOM (4X4 map)
2. Train the current network with the Kohonen’s algorithm
3. Calibrate the network using known I/O patterns to determine
 Node should be replaced with a submap of several nodes
(2X2 map)
 Node should be deleted
4. Unless every node represents a unique class, go to step 2
40
Learning Procedure
Input data
Initialize map as 4X4
Train with Kohonen’s algorithm
Structure adaptation
Find nodes whose hit_ratio is less than 95.0%
Split the nodes to 2X2 submap
Train the split nodes with LVQ algorithm
Remove nodes not participated in learning
Stop condition satisfied?
Yes
Map generated
41
No
Kohonen’s Learning
• Initialization
– 4X4 rectangular map using Kohonen’s learning algorithm
• Learning
– Winner node
x  mc  min{ x  m }
i
i
– Kohonen’s learning algorithm
mi (t  1)  mi (t )   (t )  nci (t )  {x(t )  mi (t )}
42
nci (t )
Neighborhood function
 (t )
Learning rate
Dynamic Node-Splitting
• Determining whether a node is to be split or not
– Hit ratio
hit _ ratio i  max P( c j ni )
where i  1, 2, , M and j  1, 2, , N
j
– Nodes less than 95.0% of hit ratio are split
43
1,0
1
0
1
0 1
1 1
1
0
1
1
1
1
1
1
1
1
1
1
1,0
1,0
1
1
0 1
1 1
0 1
1 1
1
0
1
1,0
1
0
1
0 0
1 1
1
Initial Weight of Split Nodes
C
C
P
Nc
S
: Child node
( P  2)   Nc
S
: Parent node
: Weights of neighbors
: Total number of nodes that participate in weight initialization
P1
P1
P0
P4
P3
44
P2
C0
C1
C2
C3
P0
P2
P3
LVQ Learning for Modified Map
mi (t  1)  mi (t )   (t )  nci (t )  hci (t )  {x(t )  mi (t )}

hci (t )  1 ,
hci (t )  0 ,
x(t ) and mi (t ) belong to the same class
x(t ) and mi (t ) belong to different classes
– Neighborhood function is used to preserve the topological
order
45
Homework #1
1. Information Geometry에 근거한 MLP 학습원리 설명 및
학습성능 향상을 위한 방법론을 조사하시오.
2. MLP를 실제문제 해결에 사용하기 위한 Tips를 네트워크의
구조, 학습 알고리즘, 학습데이터 전처리로 나누어
조사하시오.
46

Download Report

Δw ij w jk

Paperzz.com

Your Paperzz