Neural Networks Preliminaries: Notation: Write , and in integrals B œ

Neural Networks Preliminaries:
Notation: Write B œ ÐB" ß B# ß á B. Ñ , and in integrals
.B œ .B" .B# á .B. Þ
Recall:
Definition:
P: norm:
m0 ÐBÑm: œ Œ( .B l0 ÐBÑl 
"Î:
:
Þ
GБ. Ñ œ continuous functions on ‘. . Define norm on GБ. Ñ by
m0 ÐBÑm_ œ sup l0 ÐBÑl.
A set of functions W is dense in P: norm if every 0 − P: can be
arbitrarily well approximated by functions in W
A set of functions is dense in GБ. Ñ norm if for every 0 − GБ. Ñ
there is a sequence of functions 03 − GБ. Ñ such that m03  0 m_
Ò !.
3Ä_
A function 0 ÐBÑ defined on ‘. is compactly supported if the set of
B − ‘. for which 0 ÐBÑ Á ! is bounded (i.e., satisfies mBm# œ !
B#3
i
 Q for some fixed Q  !).
A function 0 ÐBÑ defined for B − ‘ is monotone increasing if 0 ÐBÑ
does not decrease as B gets larger.
262
Neural Networks and Radial Basis Functions
1. Neural network theory
1. Since artificial intelligence (using Von Neumann processors)
has failed to produce true intelligence, we wish work towards
computational solutions to problems in intelligence
2. Neural network theory has held that promise.
Existence proof: neural nets work in people, insects, etc.
Applications of neural nets:
engineering, mathematics
Physics, biology, psychology,
3. A basic component of many neural nets: feed-forward neural
networks.
The role of learning theory has grown a great deal in:
•
•
•
•
cortex
Mathematics
Statistics
Computational Biology
Neurosciences, e.g., theory of plasticity, workings of visual
263
Source: University of Washington
ttp://sig.biostr.washington.edu/projects/brain/images/3.16.21.gif&imgr
efurl
• Computer science, e.g., vision theory, graphics, speech
synthesis
Source: T. Poggio/MIT
Face identification:
264
MIT
People classification or detection:
Poggio/MIT
What is the theory behind such learning algorithms?
2. The problem: Learning theory
Given an unknown function 0 ÐxÑ À ‘. Ä ‘ß learn 0 ÐxÑ.
Example 1: x is retinal activation pattern (i.e., B3 œ activation level of
retinal neuron 3), and C œ 0 ÐxÑ  ! if the retinal pattern is a chair;
C œ 0 ÐxÑ  ! otherwise.
[Thus: want concept of a chair]
265
Given: examples of chairs (and non-chairs): x" ß x# ß á ß x8 , together
with proper outputs C" ß á ß C8 Þ This is the information:
R 0 œ Ð0 Ðx" Ñß á ß 0 Ðx8 ÑÑ
Goal: Give best possible estimate of the unknown function 0 , i.e., try
to learn the concept 0 from the examples R 0 .
But: given pointwise information about 0 not sufficient: which is the
"right" 0 ÐBÑ given the data R 0 below?
(a)
266
(b)
[How to decide?]
3. The neural network approach to learning:
Feed-forward neural network:
fig. 58
Layer œ vertical row of neurons
Neurons in first layer influence neurons in second layer.
Neurons in second layer influence neurons in third layer.
Etc.
First layer contains “input", i.e., we control activations of its
neurons.
267
Last layer contains “output", i.e., its activations provide a desired
output that the neural network provides in response to input in first
layer.
Funahashi and Hecht-Nielsen have shown that if we desire a
network which is able to take an arbitrary input pattern in the first
layer, and provide an arbitrary desired output pattern in the last
layer, all that is necessary is 3 layers:
fig. 59
Now consider only 3 layer networks.
B3 œ activation level (either chemical or electrical potential) of
i th neuron in first layer
C3 œ activation level of i th neuron in second layer
;3 œ activation level of i th neuron in third layer
/34 œ strength of connection (weight) from j >2 neuron in layer " to
i th neuron in layer #.
A34 œ strength of connection (weight) from j th neuron in layer # to
i th neuron in layer $.
Example: First layer is retina and is the illumination level at the
neuron B3 . This is input layer (light shines on retina and activates
it).
268
Last layer is speech center (neurons ultimately connected to
mouth), and its pattern ;3 of neuron activations corresponds to
verbal description about to be delivered of what is seen in first
layer.
2. Neuron interaction rule
Neurons in one layer influence those in next layer
linear way:
in
almost
a
C3 œ L Œ " /34 B4  )3 
5
4œ"
i.e., activation C3 is a linear function of activations B4 in previous
layerß aside from function L .
here )3 œ constant for each 3.
The function L is a sigmoid:
fig 60
Note that L has an upper bound, so response cannot exceed
some constant.
Activation in third layer:
;3 œ ! A34 C4
8
4œ"
269
œ linear function of the C4 's
Goal: show we can get an arbitrary desired output pattern ;3
of activations on last layer as a function of inputs B3 in the first
layer.
Vector notation:
Ô B" ×
ÖB Ù
B œÖ #Ù
ã
Õ B5 Ø
œ vector of neuron activations in layer
1.
Ô @3" ×
Ö@ Ù
Z 3 œ Ö 3# Ù
ã
Õ @35 Ø
œ vector of connection weights from neurons in first layer to i th
neuron in the second layer
Now: activation C3 in second layer is:
C3 œ L Œ " /34 B4  )3  œ LÐZ 3 † B  )3 ÑÞ
5
4œ"
Activation ;3 in the third layer is:
;3 œ ! A34 C4 œ [ 3 † CÞ
8
4œ"
Goal: to show that the pattern of activations
270
Ô ;" ×
Ö; Ù
; œ Ö #Ù
ã
Õ ;5 Ø
on the last layer (output) can be made to be an arbitrary function
of the pattern of activation
Ô B" ×
ÖB Ù
B œÖ #Ù
ã
Õ B5 Ø
on the first layer.
Notice: activation of i >2 neuron in layer $ is:
;3 œ " A34 C4 œ " A34 LÐZ 4 † B  )4 ÑÞ
8
4œ"
8
(1)
4œ"
Thus question is: if ; œ 0 ÐBÑ is defined by (1) (i.e., input
determines output through a neural network equation), is it
possible to approximate any function in this form?
Ex. If first layer = retina, then can require that if B œ visual
image of chair (vector of pixel intensities corresponding to chair),
then ; œ neural pattern of intensities corresponding to articulation
of the words "this is a chair"
Equivalently:
Given any function 0 ÐBÑ À ‘5 Ä ‘5 , can we
approximate 0 ÐBÑ by a function of the form (1) in
(a) GБ5 Ñ norm?
(b) P" norm?
(c) P# norm?
Can show that these questions are equivalent to case where there
is only one ; :
271
fig 61
Then have from (1):
(2)
; œ ! A4 C4 œ ! A4 LÐZ 4 † B  )4 ÑÞ
8
8
4œ"
4œ"
Question: Can any function 0 ÐBÑ À ‘5 Ä ‘ be approximately
represented in this form?
Partial Answer: Hilbert's 13th problem.
1957: Kolmogorov solved 13th problem by proving that any
continuous function 0 À ‘5 Ä ‘ can be represented in the form
0 ÐBÑ œ ! ;4 Œ ! <34 ÐB3 Ñ Þ
#5"
5
4œ"
3œ"
where ;4 ß <34 are continuous functions, <34 are monotone and
independent of 0 .
That is, 0 can be represented as sum of functions each of which
depends just on a sum of single variable functions.
3. Some results
1987: Hecht-Nielsen remarks that if we have 4 layers
272
fig 62
any function 0 ÐBÑ can be approximated within % in GБ. Ñ norm by
such a network.
Caveat: we don't know how many neurons it will take in the
middle.
1989: Funahashi proved:
Theorem: Let LÐBÑ be a non-constant, bounded, and monotone
increasing function. Let O be a compact (closed and bounded)
subset of ‘5 , and 0 ÐBÑ be a real-valued continuous function on
O:
fig 63
Then for arbitrary %  !, there exist real constants A4 ß )4 , and
vectors Z 4 such that
273
0 ÐBÑ œ " A4 LÐZ 4 † B  )4 Ñ
8
(2)
4œ"
satisfies
m0 ÐBÑ  0 ÐBÑm_ Ÿ %Þ
Thus functions of the form (2) are dense in the Banach space
GÐOÑ, defined in the ² † ² _ norm.
Corollary: Functions of the form (3) are dense in P: ÐOÑ for all
:ß " Ÿ :  _.
That is, given any such :, and an i-o (input-output) function
0 ÐBÑ − P: ÐOÑ and %  !, there exists an 0 of the form (3) such
that ² 0  0 m: Ÿ %
Caveat: may need very large hidden layer (middle layer).
Important question: How large will the hidden layer need to be to
get an approximation within % (i.e., how complex is it to build a
neural network which recognizes a chair)?
4. Newer activation functions:
Recall LÐBÑ is assumed to be a sigmoid function:
fig 64
274
Reason: biological plausibility.
Newer idea: how about a localized LÐBÑ
fig 65
Not as biologically plausible, but may work better. E.G., L could
be a wavelet?
Poggio, Girosi, others pointed out: if LÐBÑ œ cos B on Ò!ß _Ñ,
get
0 ÐBÑ œ " A4 cos ÐZ 4 † B  )4 ÑÞ
8
(3)
4œ"
Now choose Z 4 œ 7 œ Ð7" ß 7# ß 7$ ß á Ñ
nonnegative integers, and )4 œ !. Then
where
73
are
0 ÐBÑ œ ! A7 cos Ð7 † BÑÞ
7
Now if
O œ ÖÐB" ß B# ß á ß B5 Ñl  1 Ÿ B3 Ÿ 1 if 3 œ #ß $ß á and ! Ÿ B" Ÿ 1×,
then this is just a multivariate Fourier cosine series in B. We know
that continuous functions can be approximated by multivariate
Fourier series, and we know how to find the A4 very easily:
A4 œ Œ 1#5  ' 0 ÐBÑ cos 7 † B .BÞ
275
We can build the network immediately, since we know what the
weights need to be if we know the 3  9 function. Very powerful.
Notice that LÐBÑ here is:
fig. 66
nothing like a sigmoid.
Note: questions of stability however - make a small mistake in B,
and cos 7 † B may vary wildlyÞ
However, in machine tasks this may not be as crucial.
Why does this not solve the approximation problem once and for
all? It ignores learning. The learning problem is better solved by:
5. Radial basis functions
Had:
; œ ! A4 C4 œ ! A4 LÐZ 4 † B  )4 Ñ œ 0 ÐBÑÞ
8
8
4œ"
4œ"
Now consider newer families of activation functions, and neural
network protocols:
Instead of each neuron in hidden layer summing its inputs,
perhaps it can take more complicated functions of inputs:
276
Assume now that O is a fixed function:
fig 67
and assume (for some fixed choice of D3 ß 5Ñ À
C" œ C" ÐBÑ œ O Š
in general
B  D"
‹
5
C# œ C# ÐBÑ œ O Š
B  D#
‹
5
C3 œ C3 ÐBÑ œ O Š
B  D3
‹
5
Now write (again)
; œ "A3 C3 ÐBÑ œ "A3 O Š
3
3
B  D3
‹Þ
5
Goal now is to represent 3  9 function 0 as a sum of bump
functions:
fig 68
277
Functions O are called radial basis functions.
Neural networks and approximation people are interested in these.
Idea behind radial basis functions:
Each neuron C3 in hidden layer has given activation function C3 ÐBÑ
which depends on activations B œ ÐB" ß á B5 Ñ in first layer.
Weights A3 connect middle layer to output layer (single neuron)
fig 69
Output is:
; œ ! A3 C3 ÐBÑ
8
3œ"
(should be
; œ 0 ÐBÑ).
good
approximation
to
desired 3  9
function
Can check:
Best choice of weights A3 is by choosing A3 large if there is a
large “overlap” between the desired 3  9 function 0 ÐBÑ and the
3‰
given function C3 ÐBÑ œ O ˆ BD
(i.e., C3 ÐBÑ large where 0 ÐBÑ
5
large):
278
fig 70
Thus A3 measures “overlap” between
function C3 ÐBÑÞ
0 ÐBÑ and activation
Usually there is one neuron C3 which has the highest overlap A3 ;
in adaptive resonance theory, this neuron is the “winner” and all
other neurons have weight !.
Here we have that each neuron provides a weight A3 according to
the “degree of matching” of neuron with desired 3  9 function
0 ÐBÑ.
Poggio: “A theory of how the brain might work” (1990) gives
plausible arguments that something like this “matching” of desired
3  9 function against bumps like C3 ÐBÑ may be at work in the
brain (facial recognition; motor tasks in cerebellum).
6. Mathematical analysis of RBF networks:
Mathematically, class of functions we obtain has the form:
;ÐBÑ œ " A3 O Š
5
3œ"
279
B  D3
‹ß
5
(4)
where O is a fixed function and ÖD3 ß 5× are constants which may
vary.
Note: class of functions ;ÐBÑ of this form will be called W! ÐOÑ.
Again: what functions 0 ÐBÑ can be approximated by 0 ÐBÑ?
Park and Sandberg (1993) answered this question (other versions
previously):
Theorem (Park and Sandberg, (1993)):
Assuming O is
"
5
'
integrable, W! is dense in P Б Ñ if and only if O Á !Þ
That is, any i-o function 0 ÐBÑ in P" (i.e., an integrable function)
can be approximated to arbitrary degree of accuracy by a function
of the form (5) in P" norm.
Proof: Assume that ' O Á !.
Let , œ continuous compactly supported functions on ‘5 .
Then any P" function can be approximated arbitrarily well in P" by
functions in ,, i.e., , is dense in P" Þ
Thus to show that P" functions can be arbitrarily well
approximated in P" norm by functions in W! ,
it is sufficient to show that functions
approximated in P" by functions in W! .
in
Choose %1  ! and a function O- − , such that
mO  O- m"  %" Þ
Let the constant + œ
"
' O- ÐBÑ.B
Þ
Define 9 ÐBÑ œ + O- ÐBÑß so that
280
,
can be well
( 9ÐBÑ .B œ + ( O- ÐBÑ .B œ "Þ
Define 95 ÐBÑ œ
"
55
† 9 ÐBÎ5Ñ.
Basic Lemma: Let 0- − ,Þ Then
m0-  95 ‡0- m" 5Ò
!
Ä!
(Here ‡ denotes convolution)
Thus functions of the form 0- can be arbitrarily well approximated
by 95 ‡0- ;
therefore sufficient to show 95 ‡ 0- can be approximated by
functions in W! arbitrarily well.
Now write (for X below sufficiently large):
Ð95 ‡0- Ñ Ð!Ñ œ (
95 Ð!  BÑ0- ÐBÑ.B
ÒX ßX Ó<
#X
¸ @8 Ð!Ñ ´ " 95 Ð!  !3 Ñ0- Ð!3 Ñ Œ  ,
8
3œ"
85
5
(5)
where !3 are points of the form
ÒX 
#3" X
8
ß X 
#3# X
8
ßáß  X 
#35 X
8
Ó
[one point in each sub-cube of size #X Î8].
Riemann sum implies that /8 8Ò
Ä _ 95 ‡0- pointwise; then can use
dominated convergence theorem to show that convergence is also
in P" .
281
Thus we can approximate 95 ‡ 0- by /8 . Now need to show /8
can be approximated by something in W! . By (5)
" 8
0- Ð!3 ÑÐ#X Ñ5
!  !3
"
/8 Ð!Ñ œ 5
O
Š
‹.
8 ! ' O- Ð!Ñ . ! † 5 5
5
5
Now replace O- by O which can be made arbitrarily close; then
have something in W! .
Converse of the theorem (only if ) is easy.
Second theorem for P# density:
Define
W" œ "+3 OÐÐ B  D3 ÑÎ53 Ñ» +3 ß 53 − ‘à D3 − ‘. Ÿ
i
(variable scale 53 which can depend on 3 added).
Theorem: Assuming that O is square integrable, then W" ÐOÑ is
dense in P# Б. Ñ iff O is non-zero a.e.
Theorem: Assume that O is integrable and continuous and
O " Ð!Ñ does not contain any set of the form Ö>A À > !× for any
vector A. Then W" is dense in GÐ[ Ñ with respect to the sup
norm for any compact set [ .
282