Prediction Networks
• Prediction
– Predict f(t) based on values of f(t – 1), f(t – 2),…
– Two NN models: feedforward and recurrent
• A simple example (section 3.7.3)
– Forecasting gold price at a month based on its prices at
previous months
– Using a BP net with a single hidden layer
•
•
•
•
•
1 output node: forecasted price for month t
k input nodes (using price of previous k months for prediction)
k hidden nodes
Training sample: for k = 2: {(xt-2, xt-1) xt}
Raw data: gold prices for 100 consecutive months, 90 for
training, 10 for cross validation testing
• one-lag forecasting: predict xt based on xt-2 and xt-1
multilag: using predicted values for further forecasting
Prediction Networks
• Training:
– Three attempts:
k = 2, 4, 6
– Learning rate = 0.3,
momentum = 0.6
– 25,000 – 50,000
epochs
– 2-2-2 net with
good prediction
– Two larger nets
over-trained
Results
Network
2-2-1
Training
one-lag
multilag
4-4-1
Training
one-lag
multilag
6-6-1
Training
one-lag
multilag
MSE
0.0034
0.0044
0.0045
0.0034
0.0098
0.0100
0.0028
0.0121
0.0176
Prediction Networks
• Generic NN model for prediction
– Preprocessor prepares training samples x (t ) from time series data x(t )
– Train predictor using samples x (t ) (e.g., by BP learning)
• Preprocessor
– In the previous example,
• Let k = d + 1 (using previous d + 1data points to predict)
• x (t ) ( x0 (t ), x1 (t ),..., xd (t )) where xi (t ) x(t i ), i 0,...,d
– More general:
• ci is called a kernel function for different memory model (how
previous data are remembered)
• Examples: exponential trace memory; gamma memory (see p.141)
Prediction Networks
• Recurrent NN architecture
– Cycles in the net
• Output nodes with connections to hidden/input nodes
• Connections between nodes at the same layer
• Node may connect to itself
– Each node receives external input as well as input from other
nodes
– Each node may be affected by output of every other node
– With a given external input vector, the net often converges to an
equilibrium state after a number of iterations (output of every node
stops to change)
• An alternative NN model for function approximation
– Fewer nodes, more flexible/complicated connections
– Learning is often more complicated
Prediction Networks
• Approach I: unfolding to a
feedforward net
– Each layer represents a time delay
of the network evolution
– Weights in different layers are
identical
A fully connected net of 3 nodes
– Cannot directly apply BP learning
(because weights in different
layers are constrained to be
identical)
– How many layers to unfold to?
Hard to determine
Equivalent FF net of k layers
Prediction Networks
• Approach II: gradient descent
– A more general approach
– Error driven: for a given external input
E (t ) k (d k (t ) ok (t )) 2 k ek (t ) 2
where k are output nodes (desired output are known)
– Weight update
E (t )
wi , j (t )
wi , j
NN of Radial Basis Functions
• Motivations: better performance than Sigmoid function
– Some classification problems
– Function interpolation
• Definition
– A function is radial symmetric (or is RBF) if its output depends on
the distance between the input vector and a stored vector to that
function
• Distance u i where i is the input vect or, is the vector
associated with the RBF
• Output (u1 ) (u 2 ) whenever u1 u 2
– NN with RBF node function are called RBF-nets
NN of Radial Basis Functions
• Gaussian function is the most widely used RBF
(u / c ) 2
– g (u ) e
a bell-shaped function centered at u = 0.
– Continuous and differentiable
if g (u) e
– Other RBF
(u / c ) 2
then 'g (u )
e
(u / c ) 2
((u / c) 2 )' 2u g (u )
• Inverse quadratic function, hypersh]pheric function, etc
μ
Gaussian function
μ
Inverse quadratic
function
2
2 (u ) (c u ) , for β 0
2
μ
hyperspheric function
1 if u c
s (u)
0 if u c
NN of Radial Basis Functions
• Pattern classification
– 4 or 5 sigmoid hidden nodes
are required for a good
classification
– Only 1 RBF node is required
if the function can
approximate the circle
x
x
xx
x
x
x
x
x xx
NN of Radial Basis Functions
• XOR problem
– 2-2-1 network
• 2 hidden nodes are2 RBF:
ρ1 ( x) e
x t1
ρ2 ( x ) e
x t 2
, t1 [1,1]
2
, t 2 [0,0]
• Output node can be step or sigmoid
– When input x is applied
• Hidden node calculates distance xt j
then its output
• All weights to hidden nodes set to 1
• Weights to output node trained by
LMS
• t1 and t2 can also been trained
x
(1,1)
(0,1)
(0,0)
(1,0)
ρ1 ( x )
ρ2 ( x )
1
0.3678
0.1353
0.3678
0.1353
0.3678
1
0.3678
(0, 0)
(0, 1)
(1, 0)
(1, 1)
NN of Radial Basis Functions
• Function interpolation
– Suppose you know f ( x1 ) and f ( x2 ), to approximate f ( x0 )
( x1 x0 x2 ) by linear interpolation:
f ( x0 ) f ( x1 ) ( f ( x2 ) f ( x1 ))( x0 x1 ) /( x2 x1 )
– Let D1 x0 x1 , D2 x2 x0 be the distances of x0 from x1 and x2
then f ( x0 ) [ D11 f ( x1 ) D21 f ( x2 )] /[ D11 D21 ]
i.e., sum of function values, weighted and normalized by distances
– Generalized to interpolating by more than 2 known f values
• f ( x0 )
D11 f ( x1 ) D21 f ( x2 ) DP01 f ( xP0 )
D11 D21 DP01
where P0 is the number of neighbors to x0
• Only those f ( xi ) with small distance to x0 are useful
NN of Radial Basis Functions
• Example:
– 8 samples with known
function values
– f ( x0 ) can be interpolated
using only 4 nearest
neighbors ( x2 , x3 , x4 , x5 )
D21 f ( x2 ) D31 f ( x3 ) D41 f ( x4 ) D51 f ( x5 )
f ( x0 )
D21 D31 D41 D51
8D21 9 D31 3D41 8D51
D21 D31 D41 D51
• Using RBF node to achieve neighborhood effect
– One hidden node per sample: ( D) D 1
– Network output for approximating f ( x0 ) is proportional to
NN of Radial Basis Functions
• Clustering samples
– Too many hidden nodes when # of samples is large
– Grouping similar samples together into N clusters, each with
• The center: vector i
• Desired mean output: i
• Network output:
• Suppose we know how to determine N and how to cluster all P
samples (not a easy task itself), i and i can be determined by
learning
NN of Radial Basis Functions
• Learning in RBF net
– Objective:
learning
to minimize
– Gradient descent approach
– One can also obtain i by other clustering techniques, then use
GD learning for i only
Polynomial Networks
• Polynomial networks
– Node functions allow direct computing of polynomials
of inputs
– Approximating higher order functions with fewer nodes
(even without hidden nodes)
– Each node has more connection weights
• Higher-order networks
– # of weights per node:
– Can be trained by LMS
n 2n
k n
1
1 2
k
Polynomial Networks
• Sigma-pi networks
– Does not allow terms with higher powers of inputs, so they are not
a general function approximater
n n
n
– # of weights per node: 1
1 2
k
– Can be trained by LMS
• Pi-sigma networks
– One hidden layer with Sigma function:
– Output nodes with Pi function:
• Product units:
• Node computes product:
• Integer power Pj,i can be learned
• Often mix with other units (e.g., sigmoid)
© Copyright 2026 Paperzz