ML Artificial Neural Networks • Approximate real-valued, discrete-valued, and vector-valued functions • Continuous or discrete-valued attributes • Handwritten characters, spoken words, faces… • Handles noisy data • Long training times • Quick evaluation/test time • Network understanding is not important Mar-30-17 Rodney D. Nielsen 1 Perceptron ML d ì ï1 if å w j x j > 0; x0 º 1 y=í j=0 ï î0 otherwise x0=1 w0 x1 w1 x2 w2 wd Σ d åw 1 y j xj j= 0 xd -3 Mar-30-17 Rodney D. Nielsen -2 -1 0 0 1 2 3 2 Gradient Descent ML • Delta rule • Unthresholded perceptron x0=1 w0 x1 w1 x2 w2 wd xd Mar-30-17 d yˆ =å w j x j = w× x Σ j= 0 E ( w) = yˆ å( y x,y ÎTr Rodney D. Nielsen ( i) ) i) 2 ( ˆ -y 3 Representational Power ML • Perceptron representational power d ì ï1 if å w j x j > 0; x0 º 1 y=í j=0 ï î0 otherwise ++ + + + - - d åw x j j= 0 j =0 ++ + - + + - - Mar-30-17 Rodney D. Nielsen 4 ML Mar-30-17 Gradient Descent Rodney D. Nielsen 5 ML Gradient Descent Rule w j ¬ w j + hå ( y (i) - yˆ (i) ) x (i) j i ( ( )) w j ¬ w j + hå y (i) - Pˆ y (i) = 1x(i),w x (i)j i Mar-30-17 Rodney D. Nielsen 6 Gradient Descent Algorithm ML Initialize each wj to a small random value Until termination condition is met, Do For i = 1..N d (i ) i) ( yˆ = å w j x j For j = 0..d j= 0 For j = 0..d Mar-30-17 Rodney D. Nielsen 7 Stochastic Gradient Descent ML • Update weights after each instance rather than summing over all of the training data • Standard gradient descent uses the true gradient versus instance specific gradients – Therefore, can have larger step size • If there are multiple local minima in E(w), stochastic gradient descent might avoid falling into one – What else can do this? Mar-30-17 Rodney D. Nielsen 8 Stochastic Gradient Descent Alg. ML Initialize each wj to a small random value Until termination condition is met, Do For i = 1..N d (i ) i) ( yˆ = å w j x j For j = 0..d j= 0 For j = 0..d Mar-30-17 Rodney D. Nielsen 9 Multilayer Networks ML • Perceptron representational power d ì ï1 if å w j x j > 0; x0 º 1 y=í j=0 ï î0 otherwise ++ + + + - - d åw x j j= 0 j =0 ++ + - + + - - Mar-30-17 Rodney D. Nielsen 10 ML Mar-30-17 Multilayer Networks Rodney D. Nielsen From Machine Learning (Mitchell, 1997) 11 Net of Linear Outputs still Linear ML x0=1 y1 w21 w22 w23 x1 x2 w2h xd y2 y3 x0=1 w 0 w x1 w 1 x2 yk 2 d Σ yˆ =å w j x j = wx j= 0 wd xd Mar-30-17 Rodney D. Nielsen 12 Perceptron – Not Differentiable ML d ì ï1 if å w j x j > 0; x0 º 1 y=í j=0 ï î0 otherwise x0=1 w0 x1 w1 x2 w2 wd Σ d åw 1 y j xj j= 0 xd -3 Mar-30-17 Rodney D. Nielsen -2 -1 0 0 1 2 3 13 Logistic Nonlinear & Differentiable ML x0=1 w0 x1 w1 x2 w2 wd Σ d g( x) = å w j x j 1 yˆ = s ( x) = 1+ e-g( x) j= 0 w0 º 1 xd ds ( z) = s ( z)(1- s ( z)) dz Mar-30-17 Rodney D. Nielsen 14 Multilayer Networks – Logistic σ(x) ML Mar-30-17 Rodney D. Nielsen From Machine Learning (Mitchell, 1997) 15 The Backpropagation Algorithm ML • Sum the error over all N training instances as before • Also sum over the K outputs of the network N K E ( w) = å å ( y - yˆ (i) k ) (i) 2 k i=1 k=1 Mar-30-17 Rodney D. Nielsen 16 ML Backpropagation Algorithm } } Create network and init each wjh & whk to a small random value Until termination condition is met, Do: (1000s of epochs) wjh whk For each <x,y> x0=1 Calculate all unit outputs w21 x1 w22 Error Backpropagation: x2 w23 For all net outputs, k = 0..K w2h dk ¬ yˆ k (1- yˆ k )( y k - yˆ k ) For all hidden units, h = 0..H dh ¬ oh (1- oh )åk=1..K whkdk xd Rodney D. Nielsen y2 y3 yk Update all w.. Mar-30-17 y1 17 Stopping Criteria ML • Until termination condition is met, Do … • • • • Specified number of iterations Error on training data < ε Error on validation data < ε arg mini error on validation data ^ • arg mini f (error on validation) Mar-30-17 Rodney D. Nielsen 18 Derivation of Backpropagation Rule ML Mar-30-17 Rodney D. Nielsen 19 Derivation of Backpropagation Rule ML in k º å w hk oh = weighted sum of inputs to output unit k h ¶ = ¶yˆ k Mar-30-17 x0=1 x2 ¶s (in k ) å ( y l - yˆ l ) ¶in oh k l=1..K Rodney D. Nielsen xd whk y1 w21 w22 w23 x1 2 wjh } Derivation for the output units } ¶E ¶E ¶in k = ¶w hk ¶in k ¶w hk ¶E = oh ¶in k ¶E ¶yˆ k = oh ¶yˆ k ¶in k w2h y2 y3 yk 20 Derivation of Backpropagation Rule ML ¶E ¶ = ¶w hk ¶yˆ k ¶s (in k ) å ( y l - yˆ l ) ¶in oh k l=1..K 2 ¶ ( y k - yˆ k ) = s (in k )(1- s (in k ))oh ¶yˆ k ¶ ( y k - yˆ k ) = 2( y k - yˆ k ) yˆ k (1- yˆ k )oh ¶yˆ k 2 ¶E = -2( y k - yˆ k ) yˆ k (1- yˆ k )oh ¶w hk Mar-30-17 Rodney D. Nielsen 21 Derivation of Backpropagation Rule ML ¶E ¶ = ¶whk ¶whk = ¶ ¶whk å( y l - yˆ l ) 2 l=1..K ( y k - yˆ k ) = 2( y k - yˆ k ) 2 ¶ ( y k - yˆ k ) ¶whk ¶yˆ k = -2( y k - yˆ k ) ¶w hk ¶s ( wHk o) = -2( y k - yˆ k ) ¶w hk Mar-30-17 Rodney D. Nielsen 22 Derivation of Backpropagation Rule ML ¶s ( wHko) ¶E = -2( y k - yˆ k ) ¶whk ¶whk ¶wHko = -2( y k - yˆ k )s ( wHk o)(1- s ( wHko)) ¶whk ¶whk oh = -2( y k - yˆ k ) yˆ k (1- yˆ k ) ¶whk ¶E = -2( y k - yˆ k ) yˆ k (1- yˆ k )oh ¶whk Mar-30-17 Rodney D. Nielsen 23 Derivation of Backpropagation Rule ML ¶E = -2( y k - yˆ k ) yˆ k (1- yˆ k )oh ¶whk Mar-30-17 Rodney D. Nielsen 24 Arbitrary Acyclic ANNs ML • ANNs are not limited to two layers of σ units – Can be any depth • ANNs are not limited to fully connected layered networks – Can be any directed acyclic graph (DAG) Mar-30-17 Rodney D. Nielsen 25 Convergence and Local Minima ML • Learning is an iterative process following gradient descent • Error surface can have several local minima • Gradient descent might not make it to the global minimum • Finding a local minimum does not necessarily mean it is trapped Mar-30-17 Rodney D. Nielsen 26 Avoiding Local Minima ML • Add Momentum term Dwhk ( n) = hdk oh + aDwhk ( n -1), 0 £ a <1 – Keeps moving somewhat in the previous direction – If direction unchanged, takes increasingly larger step • Use stochastic gradient descent – Different error surface for each training instance • Train multiple networks with different initial w – Select based on validation data or use ensemble Mar-30-17 Rodney D. Nielsen 27 ML Representational Power • Boolean functions can be learned by an appropriately sized single-hidden-layer network • Bounded continuous functions can be learned by the right sized single-HL network • Arbitrary functions can be learned by a twoHL network • Caveat emptor: Cannot necessarily reach an optimal weight vector from some initial w Mar-30-17 Rodney D. Nielsen 28 Hidden Layer Representations ML Inputs 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001 Mar-30-17 Outputs 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001 Rodney D. Nielsen From Machine Learning (Mitchell, 1997) .89 1 .04 0 .08 0 .15 0 .99 1 .99 1 .01 0 .97 1 .27 0 .99 1 .97 1 .71 1 .03 0 .05 0 .02 0 .01 0 .11 0 .88 1 .80 1 .01 0 .98 1 .60 1 .94 1 .01 0 29 Sum of squared errors ML Sum of squared errors for each output unit 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Mar-30-17 500 1000 1500 Rodney D. Nielsen From Machine Learning (Mitchell, 1997) 2000 2500 30 Hidden Unit Encoding for x=01000000 ML Hidden unit encoding for input 01000000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Mar-30-17 500 1000 1500 Rodney D. Nielsen From Machine Learning (Mitchell, 1997) 2000 2500 31 Weights to one Hidden Unit ML Weights from inputs to one hidden unit 4 3 2 1 0 -1 -2 -3 -4 -5 0 Mar-30-17 500 1000 1500 Rodney D. Nielsen From Machine Learning (Mitchell, 1997) 2000 2500 32 Generalization, Overfitting & Stopping ML 0.01 Training set error Test set error 0.009 0.008 Error 0.007 0.006 0.005 0.004 0.003 0.002 0 Mar-30-17 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 Training iterations Rodney D. Nielsen From Machine Learning (Mitchell, 1997) 33 Avoiding Overfitting ML • Over many epochs, weights grow large to fit idiosyncrasies of (and noise in) the training data – Add a weight decay • Use validation data set to assess performance – Rollback to best weights Mar-30-17 Rodney D. Nielsen 34 Generalization, Overfitting & Stopping ML 0.08 Training set error Test set error 0.07 0.06 Error 0.05 0.04 0.03 0.02 0.01 0 0 Mar-30-17 1000 2000 4000 3000 Training iterations Rodney D. Nielsen From Machine Learning (Mitchell, 1997) 5000 6000 35 Advanced Topics in ANNs ML • Alternative error minimization procedures – Line search – Conjugate gradient descent – Etc. Mar-30-17 Rodney D. Nielsen 36 ML Mar-30-17 Recurrent Networks Rodney D. Nielsen From Machine Learning (Mitchell, 1997) 37 Advanced Topics in ANNs ML • Dynamically modifying network structure – Start with no hidden units and add hidden units until error is at an acceptable level – Remove connections with near zero weights – Optimal brain damage: remove connections if small changes in the weight have little effect on the error Mar-30-17 Rodney D. Nielsen 38 ML Summary • Real, discrete, and vector-valued functions • Continuous or discrete valued attributes • One-HL networks can represent any Boolean or continuous-valued function • Two-HL networks can represent any function • Hidden units create new features not in the input space • Avoid overfitting, as with any ML algorithm • Backpropagation only one of many algorithms Mar-30-17 Rodney D. Nielsen 39
© Copyright 2026 Paperzz