Artificial Neural Networks • Threshold units • Gradient descent • Multilayer networks • Backpropagation • Hidden layer representations • Example: Face recognition • Advanced topics CS 8751 ML & KDD Artificial Neural Networks 1 Connectionist Models Consider humans • • • • • Neuron switching time ~.001 second Number of neurons ~1010 Connections per neuron ~104-5 Scene recognition time ~.1 second 100 inference step does not seem like enough must use lots of parallel computation! Properties of artificial neural nets (ANNs): • • • • Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically CS 8751 ML & KDD Artificial Neural Networks 2 When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g., raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant Examples: • Speech phoneme recognition [Waibel] • Image classification [Kanade, Baluja, Rowley] • Financial prediction CS 8751 ML & KDD Artificial Neural Networks 3 ALVINN drives 70 mph on highways Sharp Left Straight Ahead Sharp Right 4 Hidden Units 30x32 Sensor Input Retina CS 8751 ML & KDD Artificial Neural Networks 4 Perceptron X0=1 X1 1 X2 W0 W W 2 Wn n w x i 0 Xn i i n 1 if wi xi i 1 - 1 otherwise 1 if w0 w1 x1 ... wn xn 0 o( x1 ,..., xn ) - 1 otherwise Sometimes we will use simpler ve ctor notation : 1 if w x 0 o(x) - 1 otherwise CS 8751 ML & KDD Artificial Neural Networks 5 Decision Surface of Perceptron X1 X1 X2 X2 Represents some useful functions • What weights represent g(x1,x2) = AND(x1,x2)? But some functions not representable • e.g., not linearly separable • therefore, we will want networks of these ... CS 8751 ML & KDD Artificial Neural Networks 6 Perceptron Training Rule wi wi wi • where wi (t o) xi t c( x ) is target va lue o is perceptron output is small constant (e.g., .1) called learning rate Can prove it will converge If training data is linearly separable and is sufficient ly small CS 8751 ML & KDD Artificial Neural Networks 7 Gradient Descent To understand , consider simple linear uni t , where o w0 w1 x1 ... wn xn Idea : learn wi ' s that minimize the squared error 1 2 E w 2 (t d od ) d D Where D is the set of training examples CS 8751 ML & KDD Artificial Neural Networks 8 Gradient Descent CS 8751 ML & KDD Artificial Neural Networks 9 Gradient Descent E E E Gradient E[ w] , ,..., wn w0 w1 Training rule : wi E[ w] i.e., CS 8751 ML & KDD E wi wi Artificial Neural Networks 10 Gradient Descent E 1 2 ( t o ) d d wi wi 2 d 1 (t d od ) 2 2 d wi 1 2(t d od ) (t d od ) 2 d wi (t d od ) (t d w xd ) wi d E (t d od )( xi ,d ) wi d CS 8751 ML & KDD Artificial Neural Networks 11 Gradient Descent GRADIENT DESCENT (training _ examples, ) Each training examples is a pair of the form x,t , where x is the vector of input values and t is the t arg et output value. is the lea rning rate (e.g., .05 ). Initialize each wi to some small random value Until the terminati on condition is met, do - Initialize each wi to zero. - For each x,t in training _ examples, do Input the instance x and compute output o * For each linear unit weigh t wi , do wi wi (t o) xi - For each linear unit weigh t w, do wi wi wi CS 8751 ML & KDD Artificial Neural Networks 12 Summary Perceptron training rule guaranteed to succeed if • Training examples are linearly separable • Sufficiently small learning rate Linear unit training rule uses gradient descent • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate • Even when training data contains noise • Even when training data not separable by H CS 8751 ML & KDD Artificial Neural Networks 13 Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satisfied: 1. Compute the gradient ED [w] 2. w w ED [ w] Incremental mode Gradient Descent: Do until satisfied: - For each training example d in D 1. Compute the gradient Ed [w] 2. w w Ed [ w] ED [ w] 1 2 2 ( t o ) d d d D Ed [ w] 12 (t d od ) 2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if made small enough CS 8751 ML & KDD Artificial Neural Networks 14 Multilayer Networks of Sigmoid Units h2 h1 d1 d3 d3 d1 d0 o h1 h2 x1 x2 CS 8751 ML & KDD d2 Artificial Neural Networks d2 d0 15 Multilayer Decision Space CS 8751 ML & KDD Artificial Neural Networks 16 Sigmoid Unit X0=1 X1 1 X2 W0 W W 2 Wn n net wi xi o σ (net ) i 0 Xn 1 1 e net σ ( x) is the sigmoid function 1 1 e -x d ( x ) Nice property : ( x)(1 ( x)) dx We can derive gradient descent rules to train One sigmoid unit Multilayer networks of sigmoid units Backpropagation CS 8751 ML & KDD Artificial Neural Networks 17 The Sigmoid Function output 1 σ ( x) 1 e -x 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 net input Sort of a rounded step function Unlike step function, can take derivative (makes learning possible) CS 8751 ML & KDD Artificial Neural Networks 18 6 Error Gradient for a Sigmoid Unit But we know : E 1 (t d od ) 2 wi wi 2 dD 1 (t d od ) 2 2 d wi 1 2 ( t o ) (t d od ) d d 2 d wi o (t d od ) d d wi od net d - (t d od ) net d wi d CS 8751 ML & KDD od (net d ) od (1 od ) net d net d net d ( w xd ) xi ,d wi wi So : E (t d od )od (1 od ) xi ,d wi d D Artificial Neural Networks 19 Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, do For each train ing example, do 1. Input the training example and compute the outputs 2. For each output unit k k ok (1 ok )(t k ok ) 3. For each hidden unit h k oh (1 oh ) w koutputs h ,k k 4. Update each network we ight wi , j wi , j wi , j wi , j where wi , j j xi , j CS 8751 ML & KDD Artificial Neural Networks 20 More on Backpropagation • Gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Will find a local, not necessarily global error minimum – In practice, often works well (can run multiple times) • Often include weight momentum a wi , j (n) j xi , j a wi , j (n 1) • Minimizes error over training examples • Will it generalize well to subsequent examples? • Training can take thousands of iterations -- slow! – Using network after training is fast CS 8751 ML & KDD Artificial Neural Networks 21 Learning Hidden Layer Representations Outputs Input Output 10000000 10000000 01000000 01000000 00100000 00100000 00010000 00010000 Inputs 00001000 00001000 00000100 00000100 00000010 00000010 00000001 00000001 CS 8751 ML & KDD Artificial Neural Networks 22 Learning Hidden Layer Representations Outputs Input Output 10000000 .89 .04 .08 10000000 01000000 .01 .11 .88 01000000 Inputs 00100000 .01 .97 .27 00100000 00010000 .99 .97 .71 00010000 00001000 .03 .05 .02 00001000 00000100 .22 .99 .99 00000100 00000010 .80 .01 .98 00000010 00000001 .60 .94 .01 00000001 CS 8751 ML & KDD Artificial Neural Networks 23 Output Unit Error during Training Sum of squared errors for each output unit 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 CS 8751 ML & KDD 500 1000 1500 Artificial Neural Networks 2000 2500 24 Hidden Unit Encoding Hidden unit encoding for one input 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 CS 8751 ML & KDD 500 1000 1500 Artificial Neural Networks 2000 2500 25 Input to Hidden Weights Weights from inputs to one hidden unit 4 3 2 1 0 -1 -2 -3 -4 -5 0 CS 8751 ML & KDD 500 1000 1500 Artificial Neural Networks 2000 2500 26 Convergence of Backpropagation Gradient descent to some local minimum • Perhaps not global minimum • Momentum can cause quicker convergence • Stochastic gradient descent also results in faster convergence • Can train multiple networks and get different results (using different initial weights) Nature of convergence • Initialize weights near zero • Therefore, initial networks near-linear • Increasingly non-linear functions as training progresses CS 8751 ML & KDD Artificial Neural Networks 27 Expressive Capabilities of ANNs Boolean functions: • Every Boolean function can be represented by network with a single hidden layer • But that might require an exponential (in the number of inputs) hidden units Continuous functions: • Every bounded continuous function can be approximated with arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989] • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988] CS 8751 ML & KDD Artificial Neural Networks 28 Overfitting in ANNs Error Error versus weight updates (example 1) 0.01 0.009 0.008 0.007 0.006 0.005 0.004 0.003 0.002 Training set Validation set 0 5000 10000 15000 20000 Number of weight updates CS 8751 ML & KDD Artificial Neural Networks 29 Overfitting in ANNs Error Error versus weight updates (Example 2) 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Training set Validation set 0 1000 2000 3000 4000 5000 6000 Number of weight updates CS 8751 ML & KDD Artificial Neural Networks 30 Neural Nets for Face Recognition left strt rgt up 30x32 inputs 90% accurate learning head pose, and recognizing 1-of-20 faces Typical Input Images CS 8751 ML & KDD Artificial Neural Networks 31 Learned Network Weights left strt rgt up Learned Weights 30x32 inputs Typical Input Images CS 8751 ML & KDD Artificial Neural Networks 32 Alternative Error Functions Penalize large weights : E ( w) 12 (t kd okd ) 2 d D koutputs 2 w ji i, j Train on target slopes as well as values : 2 o t 2 E ( w) 12 (t kd okd ) kdj kdj x d x d d D koutputs Tie together weights : e.g., in phoneme recognitio n CS 8751 ML & KDD Artificial Neural Networks 33 Recurrent Networks y(t+1) y(t+1) x(t) x(t) Feedforward Network Recurrent Network y(t+1) y(t) x(t) y(t-1) x(t-1) Recurrent Network unfolded in time x(t-2) CS 8751 ML & KDD Artificial Neural Networks 34
© Copyright 2026 Paperzz