Neural Networks

Prediction Networks
• Prediction
– Predict f(t) based on values of f(t – 1), f(t – 2),…
– Two NN models: feedforward and recurrent
• A simple example (section 3.7.3)
– Forecasting gold price at a month based on its prices at
previous months
– Using a BP net with a single hidden layer
•
•
•
•
•
1 output node: forecasted price for month t
k input nodes (using price of previous k months for prediction)
k hidden nodes
Training sample: for k = 2: {(xt-2, xt-1) xt}
Raw data: gold prices for 100 consecutive months, 90 for
training, 10 for cross validation testing
• one-lag forecasting: predict xt based on xt-2 and xt-1
multilag: using predicted values for further forecasting
Prediction Networks
• Training:
– Three attempts:
k = 2, 4, 6
– Learning rate = 0.3,
momentum = 0.6
– 25,000 – 50,000
epochs
– 2-2-2 net with
good prediction
– Two larger nets
over-trained
Results
Network
2-2-1
Training
one-lag
multilag
4-4-1
Training
one-lag
multilag
6-6-1
Training
one-lag
multilag
MSE
0.0034
0.0044
0.0045
0.0034
0.0098
0.0100
0.0028
0.0121
0.0176
Prediction Networks
• Generic NN model for prediction
– Preprocessor prepares training samples x (t ) from time series data x(t )
– Train predictor using samples x (t ) (e.g., by BP learning)
• Preprocessor
– In the previous example,
• Let k = d + 1 (using previous d + 1data points to predict)
• x (t )  ( x0 (t ), x1 (t ),..., xd (t )) where xi (t )  x(t  i ), i  0,...,d
– More general:
• ci is called a kernel function for different memory model (how
previous data are remembered)
• Examples: exponential trace memory; gamma memory (see p.141)
Prediction Networks
• Recurrent NN architecture
– Cycles in the net
• Output nodes with connections to hidden/input nodes
• Connections between nodes at the same layer
• Node may connect to itself
– Each node receives external input as well as input from other
nodes
– Each node may be affected by output of every other node
– With a given external input vector, the net often converges to an
equilibrium state after a number of iterations (output of every node
stops to change)
• An alternative NN model for function approximation
– Fewer nodes, more flexible/complicated connections
– Learning is often more complicated
Prediction Networks
• Approach I: unfolding to a
feedforward net
– Each layer represents a time delay
of the network evolution
– Weights in different layers are
identical
A fully connected net of 3 nodes
– Cannot directly apply BP learning
(because weights in different
layers are constrained to be
identical)
– How many layers to unfold to?
Hard to determine
Equivalent FF net of k layers
Prediction Networks
• Approach II: gradient descent
– A more general approach
– Error driven: for a given external input
E (t )  k (d k (t )  ok (t )) 2  k ek (t ) 2
where k are output nodes (desired output are known)
– Weight update
E (t )
wi , j (t )  
wi , j
NN of Radial Basis Functions
• Motivations: better performance than Sigmoid function
– Some classification problems
– Function interpolation
• Definition
– A function is radial symmetric (or is RBF) if its output depends on
the distance between the input vector and a stored vector to that
function
• Distance u    i where i is the input vect or,  is the vector
associated with the RBF
• Output (u1 )  (u 2 ) whenever u1  u 2
– NN with RBF node function are called RBF-nets
NN of Radial Basis Functions
• Gaussian function is the most widely used RBF
 (u / c ) 2
–  g (u )  e
a bell-shaped function centered at u = 0.
– Continuous and differentiable
if  g (u)  e
– Other RBF
 (u / c ) 2
then 'g (u )
e
(u / c ) 2
((u / c) 2 )'  2u g (u )
• Inverse quadratic function, hypersh]pheric function, etc
μ
Gaussian function
μ
Inverse quadratic
function
2 
2 (u )  (c  u ) , for β  0
2
μ
hyperspheric function
1 if u  c
 s (u)  
0 if u  c

NN of Radial Basis Functions
• Pattern classification
– 4 or 5 sigmoid hidden nodes
are required for a good
classification
– Only 1 RBF node is required
if the function can
approximate the circle
x
x
xx
x
x
x
x
x xx
NN of Radial Basis Functions
• XOR problem
– 2-2-1 network
• 2 hidden nodes are2 RBF:
ρ1 ( x)  e
 x t1
ρ2 ( x )  e
 x t 2
, t1  [1,1]
2
, t 2  [0,0]
• Output node can be step or sigmoid
– When input x is applied
• Hidden node calculates distance xt j
then its output
• All weights to hidden nodes set to 1
• Weights to output node trained by
LMS
• t1 and t2 can also been trained
x
(1,1)
(0,1)
(0,0)
(1,0)
ρ1 ( x )
ρ2 ( x )
1
0.3678
0.1353
0.3678
0.1353
0.3678
1
0.3678
(0, 0)
(0, 1)
(1, 0)
(1, 1)
NN of Radial Basis Functions
• Function interpolation
– Suppose you know f ( x1 ) and f ( x2 ), to approximate f ( x0 )
( x1  x0  x2 ) by linear interpolation:
f ( x0 )  f ( x1 )  ( f ( x2 )  f ( x1 ))( x0  x1 ) /( x2  x1 )
– Let D1  x0  x1 , D2  x2  x0 be the distances of x0 from x1 and x2
then f ( x0 )  [ D11 f ( x1 )  D21 f ( x2 )] /[ D11  D21 ]
i.e., sum of function values, weighted and normalized by distances
– Generalized to interpolating by more than 2 known f values
• f ( x0 ) 
D11 f ( x1 )  D21 f ( x2 )    DP01 f ( xP0 )
D11  D21    DP01
where P0 is the number of neighbors to x0
• Only those f ( xi ) with small distance to x0 are useful
NN of Radial Basis Functions
• Example:
– 8 samples with known
function values
– f ( x0 ) can be interpolated
using only 4 nearest
neighbors ( x2 , x3 , x4 , x5 )
D21 f ( x2 )  D31 f ( x3 )  D41 f ( x4 )  D51 f ( x5 )
f ( x0 ) 
D21  D31  D41  D51
8D21  9 D31  3D41  8D51

D21  D31  D41  D51
• Using RBF node to achieve neighborhood effect
– One hidden node per sample: ( D)  D 1
– Network output for approximating f ( x0 ) is proportional to
NN of Radial Basis Functions
• Clustering samples
– Too many hidden nodes when # of samples is large
– Grouping similar samples together into N clusters, each with
• The center: vector  i
• Desired mean output: i
• Network output:
• Suppose we know how to determine N and how to cluster all P
samples (not a easy task itself),  i and i can be determined by
learning
NN of Radial Basis Functions
• Learning in RBF net
– Objective:
learning
to minimize
– Gradient descent approach
– One can also obtain  i by other clustering techniques, then use
GD learning for i only
Polynomial Networks
• Polynomial networks
– Node functions allow direct computing of polynomials
of inputs
– Approximating higher order functions with fewer nodes
(even without hidden nodes)
– Each node has more connection weights
• Higher-order networks
– # of weights per node:
– Can be trained by LMS
 n   2n 
 k n
1        
1   2 
k 
Polynomial Networks
• Sigma-pi networks
– Does not allow terms with higher powers of inputs, so they are not
a general function approximater
 n  n
n
– # of weights per node: 1           
1   2 
k 
– Can be trained by LMS
• Pi-sigma networks
– One hidden layer with Sigma function:
– Output nodes with Pi function:
• Product units:
• Node computes product:
• Integer power Pj,i can be learned
• Often mix with other units (e.g., sigmoid)