Neural Networks Preliminaries: Notation: Write B œ ÐB" ß B# ß á B. Ñ , and in integrals .B œ .B" .B# á .B. Þ Recall: Definition: P: norm: m0 ÐBÑm: œ Œ( .B l0 ÐBÑl "Î: : Þ GБ. Ñ œ continuous functions on ‘. . Define norm on GБ. Ñ by m0 ÐBÑm_ œ sup l0 ÐBÑl. A set of functions W is dense in P: norm if every 0 − P: can be arbitrarily well approximated by functions in W A set of functions is dense in GБ. Ñ norm if for every 0 − GБ. Ñ there is a sequence of functions 03 − GБ. Ñ such that m03 0 m_ Ò !. 3Ä_ A function 0 ÐBÑ defined on ‘. is compactly supported if the set of B − ‘. for which 0 ÐBÑ Á ! is bounded (i.e., satisfies mBm# œ ! B#3 i Q for some fixed Q !). A function 0 ÐBÑ defined for B − ‘ is monotone increasing if 0 ÐBÑ does not decrease as B gets larger. 262 Neural Networks and Radial Basis Functions 1. Neural network theory 1. Since artificial intelligence (using Von Neumann processors) has failed to produce true intelligence, we wish work towards computational solutions to problems in intelligence 2. Neural network theory has held that promise. Existence proof: neural nets work in people, insects, etc. Applications of neural nets: engineering, mathematics Physics, biology, psychology, 3. A basic component of many neural nets: feed-forward neural networks. The role of learning theory has grown a great deal in: • • • • cortex Mathematics Statistics Computational Biology Neurosciences, e.g., theory of plasticity, workings of visual 263 Source: University of Washington ttp://sig.biostr.washington.edu/projects/brain/images/3.16.21.gif&imgr efurl • Computer science, e.g., vision theory, graphics, speech synthesis Source: T. Poggio/MIT Face identification: 264 MIT People classification or detection: Poggio/MIT What is the theory behind such learning algorithms? 2. The problem: Learning theory Given an unknown function 0 ÐxÑ À ‘. Ä ‘ß learn 0 ÐxÑ. Example 1: x is retinal activation pattern (i.e., B3 œ activation level of retinal neuron 3), and C œ 0 ÐxÑ ! if the retinal pattern is a chair; C œ 0 ÐxÑ ! otherwise. [Thus: want concept of a chair] 265 Given: examples of chairs (and non-chairs): x" ß x# ß á ß x8 , together with proper outputs C" ß á ß C8 Þ This is the information: R 0 œ Ð0 Ðx" Ñß á ß 0 Ðx8 ÑÑ Goal: Give best possible estimate of the unknown function 0 , i.e., try to learn the concept 0 from the examples R 0 . But: given pointwise information about 0 not sufficient: which is the "right" 0 ÐBÑ given the data R 0 below? (a) 266 (b) [How to decide?] 3. The neural network approach to learning: Feed-forward neural network: fig. 58 Layer œ vertical row of neurons Neurons in first layer influence neurons in second layer. Neurons in second layer influence neurons in third layer. Etc. First layer contains “input", i.e., we control activations of its neurons. 267 Last layer contains “output", i.e., its activations provide a desired output that the neural network provides in response to input in first layer. Funahashi and Hecht-Nielsen have shown that if we desire a network which is able to take an arbitrary input pattern in the first layer, and provide an arbitrary desired output pattern in the last layer, all that is necessary is 3 layers: fig. 59 Now consider only 3 layer networks. B3 œ activation level (either chemical or electrical potential) of i th neuron in first layer C3 œ activation level of i th neuron in second layer ;3 œ activation level of i th neuron in third layer /34 œ strength of connection (weight) from j >2 neuron in layer " to i th neuron in layer #. A34 œ strength of connection (weight) from j th neuron in layer # to i th neuron in layer $. Example: First layer is retina and is the illumination level at the neuron B3 . This is input layer (light shines on retina and activates it). 268 Last layer is speech center (neurons ultimately connected to mouth), and its pattern ;3 of neuron activations corresponds to verbal description about to be delivered of what is seen in first layer. 2. Neuron interaction rule Neurons in one layer influence those in next layer linear way: in almost a C3 œ L Œ " /34 B4 )3 5 4œ" i.e., activation C3 is a linear function of activations B4 in previous layerß aside from function L . here )3 œ constant for each 3. The function L is a sigmoid: fig 60 Note that L has an upper bound, so response cannot exceed some constant. Activation in third layer: ;3 œ ! A34 C4 8 4œ" 269 œ linear function of the C4 's Goal: show we can get an arbitrary desired output pattern ;3 of activations on last layer as a function of inputs B3 in the first layer. Vector notation: Ô B" × ÖB Ù B œÖ #Ù ã Õ B5 Ø œ vector of neuron activations in layer 1. Ô @3" × Ö@ Ù Z 3 œ Ö 3# Ù ã Õ @35 Ø œ vector of connection weights from neurons in first layer to i th neuron in the second layer Now: activation C3 in second layer is: C3 œ L Œ " /34 B4 )3 œ LÐZ 3 † B )3 ÑÞ 5 4œ" Activation ;3 in the third layer is: ;3 œ ! A34 C4 œ [ 3 † CÞ 8 4œ" Goal: to show that the pattern of activations 270 Ô ;" × Ö; Ù ; œ Ö #Ù ã Õ ;5 Ø on the last layer (output) can be made to be an arbitrary function of the pattern of activation Ô B" × ÖB Ù B œÖ #Ù ã Õ B5 Ø on the first layer. Notice: activation of i >2 neuron in layer $ is: ;3 œ " A34 C4 œ " A34 LÐZ 4 † B )4 ÑÞ 8 4œ" 8 (1) 4œ" Thus question is: if ; œ 0 ÐBÑ is defined by (1) (i.e., input determines output through a neural network equation), is it possible to approximate any function in this form? Ex. If first layer = retina, then can require that if B œ visual image of chair (vector of pixel intensities corresponding to chair), then ; œ neural pattern of intensities corresponding to articulation of the words "this is a chair" Equivalently: Given any function 0 ÐBÑ À ‘5 Ä ‘5 , can we approximate 0 ÐBÑ by a function of the form (1) in (a) GБ5 Ñ norm? (b) P" norm? (c) P# norm? Can show that these questions are equivalent to case where there is only one ; : 271 fig 61 Then have from (1): (2) ; œ ! A4 C4 œ ! A4 LÐZ 4 † B )4 ÑÞ 8 8 4œ" 4œ" Question: Can any function 0 ÐBÑ À ‘5 Ä ‘ be approximately represented in this form? Partial Answer: Hilbert's 13th problem. 1957: Kolmogorov solved 13th problem by proving that any continuous function 0 À ‘5 Ä ‘ can be represented in the form 0 ÐBÑ œ ! ;4 Œ ! <34 ÐB3 Ñ Þ #5" 5 4œ" 3œ" where ;4 ß <34 are continuous functions, <34 are monotone and independent of 0 . That is, 0 can be represented as sum of functions each of which depends just on a sum of single variable functions. 3. Some results 1987: Hecht-Nielsen remarks that if we have 4 layers 272 fig 62 any function 0 ÐBÑ can be approximated within % in GБ. Ñ norm by such a network. Caveat: we don't know how many neurons it will take in the middle. 1989: Funahashi proved: Theorem: Let LÐBÑ be a non-constant, bounded, and monotone increasing function. Let O be a compact (closed and bounded) subset of ‘5 , and 0 ÐBÑ be a real-valued continuous function on O: fig 63 Then for arbitrary % !, there exist real constants A4 ß )4 , and vectors Z 4 such that 273 0 ÐBÑ œ " A4 LÐZ 4 † B )4 Ñ 8 (2) 4œ" satisfies m0 ÐBÑ 0 ÐBÑm_ Ÿ %Þ Thus functions of the form (2) are dense in the Banach space GÐOÑ, defined in the ² † ² _ norm. Corollary: Functions of the form (3) are dense in P: ÐOÑ for all :ß " Ÿ : _. That is, given any such :, and an i-o (input-output) function 0 ÐBÑ − P: ÐOÑ and % !, there exists an 0 of the form (3) such that ² 0 0 m: Ÿ % Caveat: may need very large hidden layer (middle layer). Important question: How large will the hidden layer need to be to get an approximation within % (i.e., how complex is it to build a neural network which recognizes a chair)? 4. Newer activation functions: Recall LÐBÑ is assumed to be a sigmoid function: fig 64 274 Reason: biological plausibility. Newer idea: how about a localized LÐBÑ fig 65 Not as biologically plausible, but may work better. E.G., L could be a wavelet? Poggio, Girosi, others pointed out: if LÐBÑ œ cos B on Ò!ß _Ñ, get 0 ÐBÑ œ " A4 cos ÐZ 4 † B )4 ÑÞ 8 (3) 4œ" Now choose Z 4 œ 7 œ Ð7" ß 7# ß 7$ ß á Ñ nonnegative integers, and )4 œ !. Then where 73 are 0 ÐBÑ œ ! A7 cos Ð7 † BÑÞ 7 Now if O œ ÖÐB" ß B# ß á ß B5 Ñl 1 Ÿ B3 Ÿ 1 if 3 œ #ß $ß á and ! Ÿ B" Ÿ 1×, then this is just a multivariate Fourier cosine series in B. We know that continuous functions can be approximated by multivariate Fourier series, and we know how to find the A4 very easily: A4 œ Œ 1#5 ' 0 ÐBÑ cos 7 † B .BÞ 275 We can build the network immediately, since we know what the weights need to be if we know the 3 9 function. Very powerful. Notice that LÐBÑ here is: fig. 66 nothing like a sigmoid. Note: questions of stability however - make a small mistake in B, and cos 7 † B may vary wildlyÞ However, in machine tasks this may not be as crucial. Why does this not solve the approximation problem once and for all? It ignores learning. The learning problem is better solved by: 5. Radial basis functions Had: ; œ ! A4 C4 œ ! A4 LÐZ 4 † B )4 Ñ œ 0 ÐBÑÞ 8 8 4œ" 4œ" Now consider newer families of activation functions, and neural network protocols: Instead of each neuron in hidden layer summing its inputs, perhaps it can take more complicated functions of inputs: 276 Assume now that O is a fixed function: fig 67 and assume (for some fixed choice of D3 ß 5Ñ À C" œ C" ÐBÑ œ O Š in general B D" ‹ 5 C# œ C# ÐBÑ œ O Š B D# ‹ 5 C3 œ C3 ÐBÑ œ O Š B D3 ‹ 5 Now write (again) ; œ "A3 C3 ÐBÑ œ "A3 O Š 3 3 B D3 ‹Þ 5 Goal now is to represent 3 9 function 0 as a sum of bump functions: fig 68 277 Functions O are called radial basis functions. Neural networks and approximation people are interested in these. Idea behind radial basis functions: Each neuron C3 in hidden layer has given activation function C3 ÐBÑ which depends on activations B œ ÐB" ß á B5 Ñ in first layer. Weights A3 connect middle layer to output layer (single neuron) fig 69 Output is: ; œ ! A3 C3 ÐBÑ 8 3œ" (should be ; œ 0 ÐBÑ). good approximation to desired 3 9 function Can check: Best choice of weights A3 is by choosing A3 large if there is a large “overlap” between the desired 3 9 function 0 ÐBÑ and the 3‰ given function C3 ÐBÑ œ O ˆ BD (i.e., C3 ÐBÑ large where 0 ÐBÑ 5 large): 278 fig 70 Thus A3 measures “overlap” between function C3 ÐBÑÞ 0 ÐBÑ and activation Usually there is one neuron C3 which has the highest overlap A3 ; in adaptive resonance theory, this neuron is the “winner” and all other neurons have weight !. Here we have that each neuron provides a weight A3 according to the “degree of matching” of neuron with desired 3 9 function 0 ÐBÑ. Poggio: “A theory of how the brain might work” (1990) gives plausible arguments that something like this “matching” of desired 3 9 function against bumps like C3 ÐBÑ may be at work in the brain (facial recognition; motor tasks in cerebellum). 6. Mathematical analysis of RBF networks: Mathematically, class of functions we obtain has the form: ;ÐBÑ œ " A3 O Š 5 3œ" 279 B D3 ‹ß 5 (4) where O is a fixed function and ÖD3 ß 5× are constants which may vary. Note: class of functions ;ÐBÑ of this form will be called W! ÐOÑ. Again: what functions 0 ÐBÑ can be approximated by 0 ÐBÑ? Park and Sandberg (1993) answered this question (other versions previously): Theorem (Park and Sandberg, (1993)): Assuming O is " 5 ' integrable, W! is dense in P Б Ñ if and only if O Á !Þ That is, any i-o function 0 ÐBÑ in P" (i.e., an integrable function) can be approximated to arbitrary degree of accuracy by a function of the form (5) in P" norm. Proof: Assume that ' O Á !. Let , œ continuous compactly supported functions on ‘5 . Then any P" function can be approximated arbitrarily well in P" by functions in ,, i.e., , is dense in P" Þ Thus to show that P" functions can be arbitrarily well approximated in P" norm by functions in W! , it is sufficient to show that functions approximated in P" by functions in W! . in Choose %1 ! and a function O- − , such that mO O- m" %" Þ Let the constant + œ " ' O- ÐBÑ.B Þ Define 9 ÐBÑ œ + O- ÐBÑß so that 280 , can be well ( 9ÐBÑ .B œ + ( O- ÐBÑ .B œ "Þ Define 95 ÐBÑ œ " 55 † 9 ÐBÎ5Ñ. Basic Lemma: Let 0- − ,Þ Then m0- 95 ‡0- m" 5Ò ! Ä! (Here ‡ denotes convolution) Thus functions of the form 0- can be arbitrarily well approximated by 95 ‡0- ; therefore sufficient to show 95 ‡ 0- can be approximated by functions in W! arbitrarily well. Now write (for X below sufficiently large): Ð95 ‡0- Ñ Ð!Ñ œ ( 95 Ð! BÑ0- ÐBÑ.B ÒX ßX Ó< #X ¸ @8 Ð!Ñ ´ " 95 Ð! !3 Ñ0- Ð!3 Ñ Œ , 8 3œ" 85 5 (5) where !3 are points of the form ÒX #3" X 8 ß X #3# X 8 ßáß X #35 X 8 Ó [one point in each sub-cube of size #X Î8]. Riemann sum implies that /8 8Ò Ä _ 95 ‡0- pointwise; then can use dominated convergence theorem to show that convergence is also in P" . 281 Thus we can approximate 95 ‡ 0- by /8 . Now need to show /8 can be approximated by something in W! . By (5) " 8 0- Ð!3 ÑÐ#X Ñ5 ! !3 " /8 Ð!Ñ œ 5 O Š ‹. 8 ! ' O- Ð!Ñ . ! † 5 5 5 5 Now replace O- by O which can be made arbitrarily close; then have something in W! . Converse of the theorem (only if ) is easy. Second theorem for P# density: Define W" œ "+3 OÐÐ B D3 ÑÎ53 Ñ» +3 ß 53 − ‘à D3 − ‘. Ÿ i (variable scale 53 which can depend on 3 added). Theorem: Assuming that O is square integrable, then W" ÐOÑ is dense in P# Б. Ñ iff O is non-zero a.e. Theorem: Assume that O is integrable and continuous and O " Ð!Ñ does not contain any set of the form Ö>A À > !× for any vector A. Then W" is dense in GÐ[ Ñ with respect to the sup norm for any compact set [ . 282
© Copyright 2026 Paperzz