Lecture Slides for INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml CHAPTER 11: Multilayer Perceptrons Neural Networks Networks of processing units (neurons) with connections (synapses) between them Large number of neurons: 1010 Large connectitivity: 105 Parallel processing Distributed computation/memory Robust to noise, failures 3 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Understanding the Brain Levels of analysis (Marr, 1982) 1. 2. 3. Computational theory Representation and algorithm Hardware implementation Reverse engineering: From hardware to theory Parallel processing: SIMD vs MIMD Neural net: SIMD with modifiable local memory Learning: Update by training/experience 4 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Perceptron d y w j x j w0 wT x j 1 w w 0 , w1 ,..., w d T x 1, x 1 ,..., x d T (Rosenblatt, 1962) 5 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) What a Perceptron Does Regression: y=wx+w0 y w0 Classification: y=1(wx+w0>0) y y s w0 w w x w0 x x0=+1 x 1 y sigmoid o 1 exp w T x 6 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) K Outputs Regression: d y i w ij x j w i 0 wiT x j 1 y Wx Classification: oi w iT x expoi yi k expok choose C i if y i max y k k 7 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Training Online (instances seen one by one) vs batch (whole sample) learning: No need to store the whole sample Problem may change in time Wear and degradation in system components Stochastic gradient-descent: Update after a single pattern Generic update rule (LMS rule): w ri y x t ij t t i t j Update LearningFa ctor DesiredOut put ActualOutp ut Input 8 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Training a Perceptron: Regression Regression (Linear output): 1 t E w | x ,r r y t 2 w tj r t y t x tj t t t 2 1 t r wT xt 2 2 9 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Classification Single sigmoid output y t sigmoid w T x t E t w | x t , r t r t log y t 1 r t log 1 y t w tj r t y t x tj K>2 softmax outputs expwiT x t y T t exp w k kx t E t wi i | x t , r t rit log y it i wijt rit y it x tj 10 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Sigmoid Unit x1 x2 . . . xn w1 w2 wn x0=1 w0 net=i=0n wi xi o=(net)=1/(1+e-net) o (x) is the sigmoid function: 1/(1+e-x) d(x)/dx= (x) (1- (x)) Derive gradient decent rules to train: • one sigmoid function E/wi = -d(td-od) od (1-od) xi • Multilayer networks of sigmoid units backpropagation: Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11 Sigmoid Unit x1 x2 . . . xn w1 w2 wn x0=1 w0 net=i=0n wi xi o=(net)=1/(1+e-net) o (x) is the sigmoid function: 1/(1+e-x) d(x)/dx= (x) (1- (x)) Derive gradient decent rules to train: • one sigmoid function E/wi = -d(td-od) od (1-od) xi • Multilayer networks of sigmoid units backpropagation: Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 12 Learning Boolean AND 13 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) XOR No w0, w1, w2 satisfy: w0 0 w2 w 0 0 w0 0 w1 w 2 w 0 0 w1 (Minsky and Papert, 1969) 14 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Multi-Layer Networks output layer hidden layer input layer 15 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Multilayer Perceptrons H y i v z v ih zh v i 0 T i h 1 zh sigmoid w hT x 1 exp 1 d j 1 w hj x j w h 0 (Rumelhart et al., 1986) 16 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17 Backpropagation H y i v z v ih zh v i 0 T i h 1 zh sigmoid w hT x 1 exp 1 d j 1 w hj x j w h 0 E E y i zh w hj y i zh w hj 18 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Backpropagation Algorithm Initialize each wi to some small random value Until the termination condition is met, Do For each training example <(x1,…xn),t> Do Input the instance (x1,…,xn) to the network and compute the network outputs ok For each output unit k For each hidden unit h k=ok(1-ok)(tk-ok) h=oh(1-oh) k wh,k k For each network weight w,j Do wi,j=wi,j+wi,j where wi,j= j xi,j 19 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Backpropagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum -in practice often works well (can be invoked multiple times with different initial weights) Often include weight momentum term wi,j(n)= j xi,j + wi,j (n-1) Minimizes error training examples Will it generalize well to unseen instances (over-fitting)? Training can be slow typical 1000-10000 iterations (use Levenberg-Marquardt instead of gradient descent) Using network after training is fast 20 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 1 E W,v | X r t y t 2 t Regression H vh r t y t zht y vh z v 0 t 2 t h t h 1 Backward w hj Forward T h zh sigmoid w x E w hj E y t zht t t y z t h w hj r t y t v h zht 1 zht x tj t x r t y t v h zht 1 zht x tj t 21 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Regression with Multiple Outputs yi 1 E W , V | X rit y it 2 t i 2 vih H y it v ih zht v i 0 h 1 zh v ih rit y it zht whj t w hj t t t ri y i v ih zh 1 zht x tj t i xj 22 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8-3-8 Binary Encoder -Decoder 8 inputs 3 hidden 8 outputs Hidden .89 .04 .01 .11 .01 .97 .99 .97 .03 .05 .22 .99 .80 .01 .60 .94 values .08 .88 .27 .71 .02 .99 .98 .01 24 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Sum of Squared Errors for the Output Units 25 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Hidden Unit Encoding for Input 0100000 26 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Convergence of Backprop Gradient descent to some local minimum Perhaps not global minimum Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses 27 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Expressive Capabilities of ANN Boolean functions Every boolean function can be represented by network with single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988] 28 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 29 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) whx+w0 zh vhzh 30 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Two-Class Discrimination One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt H t y sigmoid v h zh v 0 h 1 E W , v | X r t log y t 1 r t log 1 y t t t r v z 1 z x v h r t y t zht t w hj t yt h t h t h t j t 31 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) K>2 Classes t exp o t i y it P C | x i t exp o k k H oit v ih zht vi 0 h 1 E W, v | X rit log y it t i v ih rit y it zht t whj t t t ri y i vih zh 1 zht x tj t i 32 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Multiple Hidden Layers MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using multiple layers may lead to simpler networks z1h d sigmoid w x sigmoid w1hj x j w1h 0 ,h 1,..., H 1 j 1 T 1h H1 z2l sigmoid w x sigmoid w 2lhz1h w 2l 0 , l 1,..., H 2 h 1 T 2l H2 y v T z 2 v l z2l v 0 l 1 33 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Improving Convergence Momentum t E wit wit 1 wi Adaptive learning rate a if E t E t b otherwise 34 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Overfitting/Overtraining Number of weights: H (d+1)+(H+1)*K 35 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 36 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Structured MLP (Le Cun et al, 1989) 37 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Weight Sharing 38 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Hints (Abu-Mostafa, 1995) Invariance to translation, rotation, size Virtual examples Augmented error: E’=E+λhEh If x’ and x are the “same”: Eh=[g(x|θ)- g(x’|θ)]2 Approximation hint: 0 if g x | ax ,bx 2 E h g x | ax g x | b 2 x Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) if g x | ax if g x | bx 39 Tuning the Network Size Destructive Weight decay: Constructive Growing networks E w i w i w i E' E w i2 2 i (Ash, 1989) (Fahlman and Lebiere, 1989) 40 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Bayesian Learning Consider weights wi as random vars, prior p(wi) p w | X p X | w p w ˆ MAP arg max logp w | X w p X w logp w | X logp X | w logp w C w i2 p w p w i where p w i c exp 2 ( 1 / 2 ) i E' E w 2 Weight decay, ridge regression, regularization cost=data-misfit + λ complexity 41 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Dimensionality Reduction 42 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 43 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Learning Time Applications: Sequence recognition: Speech recognition Sequence reproduction: Time-series prediction Sequence association Network architectures Time-delay networks (Waibel et al., 1989) Recurrent networks (Rumelhart et al., 1986) 44 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Time-Delay Neural Networks 45 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Recurrent Networks 46 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Unfolding in Time 47 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
© Copyright 2026 Paperzz