FUNCTION LEARNING AND NEURAL NETS 1 SETTING Learn a function with : Continuous-valued examples E.g., pixels of image Continuous-valued output E.g., likelihood that image is a ‘7’ Known as regression [Regression can be turned into classification via thresholds] 2 FUNCTION-LEARNING (REGRESSION) FORMULATION Goal function f Training set: (x(i),y(i)), i = 1,…,n, y(i)=f(x(i)) Inductive inference: find a function h that fits the points well f(x) x Same Keep-It-Simple bias 3 LEAST-SQUARES FITTING Hypothesize a class of functions g(x,θ) parameterized by θ Minimize squared loss E(θ) = Σi ( g(x(i),θ)-y(i) )2 f(x) x 4 LINEAR LEAST-SQUARES g(x,θ) = x ∙ θ Value of θ that optimizes E(θ) is: θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)] E(θ) = Σi ( x(i)∙θ - y(i) )2 = Σi ( x(i) 2 θ 2 – 2 x(i)y(i) θ + y(i)2) E’(θ) = 0 => d/d θ [Σi ( x(i) 2 θ 2 – 2 x(i) y(i) θ + y(i)2)] = Σi 2 x(i)2 θ – 2 x(i) y(i) = 0 => θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)] f(x) g(x,q) 5 x LINEAR LEAST-SQUARES WITH CONSTANT OFFSET g(x,θ0,θ1) = θ0 + θ1 x E(θ0,θ1) = Σi (θ0+θ1 x(i) - y(i) )2 = Σi (θ02 + θ12 x(i) 2 + y(i)2 +2θ0θ1x(i)-2θ0y(i)-2θ1x(i)y(i)) dE/dθ0(θ0*,θ1*) = 0 and dE/dθ1(θ0*,θ1*) = 0, so: 0 = 2Σi (θ0* +θ1*x(i) - y(i)) 0 = 2Σi x(i)(θ0*+ θ1* x(i) - y(i)) f(x) Verify the solution: g(x,q) θ0* = 1/N Σi (y(i) – θ1*x(i)) θ1* = [N (Σi x(i)y(i)) – (Σi x(i))(Σi y(i))]/ [N (Σi x(i)2) – (Σi x(i))2] x 6 MULTI-DIMENSIONAL LEAST-SQUARES Let x include attributes (x1,…,xN) Let θ include coefficients (θ1,…,θN) Model g(x,θ) = x1 θ1 + … + xN θN f(x) g(x,q) x 7 MULTI-DIMENSIONAL LEAST-SQUARES g(x,θ) = x1 θ1 + … + xN θN Best θ given by θ = (ATA)-1 AT b Where A is matrix of x(i)’s in rows, b is vector of y(i)’s f(x) g(x,q) x 8 NONLINEAR LEAST-SQUARES E.g. quadratic g(x,θ) = θ0 + x θ1 + x2 θ2 E.g. exponential g(x,θ) = exp(θ0 + x θ1) Any combinations g(x,θ) = exp(θ0 + x θ1) + θ2 + x θ3 Fitting can be done using gradient descent quadratic f(x) other linear x 9 GRADIENT DESCENT g(x,θ) = x1 θ1 + … + xN θN Error E(θ) = Σi (g(x(i),θ)-y(i))2 Take derivative: dE(θ)/dθ = 2Σi dg(x(i),θ)/dθ (g(x(i),θ)-y(i)) Since dg(x(i),θ)/dθ = x(i), dE(θ)/dθ = 2Σi x(i)(g(x(i),θ)-y(i)) Update rule θ θ - Σi x(i)(g(x(i),θ)-y(i)) Convergence to global minimum guaranteed (with chosen small enough) because E is a convex function STOCHASTIC GRADIENT DESCENT Prior rule was a batch update because all examples were incorporated in each step Needs to store all prior examples Stochastic Gradient Descent: use single example on each step Update rule: Pick example i (either at random or in order) and a step size Update rule θ θ + x(i)(y(i)-g(x(i),θ)) Reduces error on i’th example… but does it converge? 11 PERCEPTRON (THE GOAL FUNCTION F IS A BOOLEAN ONE) x2 + + x1 + - - - + xi wi + S g y x1 - w1 x1 + w 2 x2 = 0 - xn y = g(Si=1,…,n wi xi) 12 PERCEPTRON (THE GOAL FUNCTION F IS A BOOLEAN ONE) + + x1 xi wi ? - S g y + + - + - xn y = g(Si=1,…,n wi xi) 13 PERCEPTRON LEARNING RULE θ θ + x(i)(y(i)-g(θT x(i))) (g outputs either 0 or 1, y is either 0 or 1) If output is correct, weights are unchanged If g is 0 but y is 1, then weight on attribute i is increased If g is 1 but y is 0, then weight on attribute i is decreased Converges if data is linearly separable, but oscillates otherwise 14 UNIT (NEURON) x1 xi wi S g y xn y = g(Si=1,…,n wi xi) g(u) = 1/[1 + exp(-u)] 15 A SINGLE NEURON CAN LEARN x1 xi wi S g y xn A disjunction of boolean literals x1 x2 x3 Majority function XOR? 16 NEURAL NETWORK Network of interconnected neurons x1 xi w x1 i xi w i xn S g y S g y xn 17 Acyclic (feed-forward) vs. recurrent networks TWO-LAYER FEED-FORWARD NEURAL NETWORK w1j Inputs w2k Hidden layer Output layer 18 NETWORKS WITH HIDDEN LAYERS Can learn XORs, other nonlinear functions As the number of hidden units increase, so does the network’s capacity to learn functions with more nonlinear features Difficult to characterize which class of functions! How to train hidden layers? 19 BACKPROPAGATION (PRINCIPLE) New example y(k) = f(x(k)) φ(k) = outcome of NN with weights w(k-1) for inputs x(k) Error function: E(k)(w(k-1)) = (φ(k) – y(k))2 wij(k) = wij(k-1) – ε∙E(k)/wij (w(k) = w(k-1) - e∙E) Backpropagation algorithm: Update the weights of the inputs to the last layer, then the weights of the inputs to the previous layer, etc. 20 UNDERSTANDING BACKPROPAGATION Minimize E(q) Gradient Descent… E(q) q 21 UNDERSTANDING BACKPROPAGATION Minimize E(q) Gradient Descent… E(q) Gradient of E q 22 UNDERSTANDING BACKPROPAGATION Minimize E(q) Gradient Descent… E(q) Step ~ gradient q 23 LEARNING ALGORITHM Given many examples (x(1),y(1)),…, (x(N),y(N)) a learning rate e Init: Set k = 0 (or rand(1,N)) Repeat: Tweak weights with a backpropagation update on example x(k), y(k) Set k = k+1 (or rand(1,N)) 24 UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q) Here ek = (g(x(k),q)-y(k))2 On each iteration take a step to reduce ek E(q) Gradient of e1 25 q UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q) Here ek = (g(x(k),q)-y(k))2 On each iteration take a step to reduce ek E(q) Gradient of e1 26 q UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q) Here ek = (g(x(k),q)-y(k))2 On each iteration take a step to reduce ek E(q) Gradient of e2 27 q UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q) Here ek = (g(x(k),q)-y(k))2 On each iteration take a step to reduce ek E(q) Gradient of e2 28 q UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q) Here ek = (g(x(k),q)-y(k))2 On each iteration take a step to reduce ek E(q) Gradient of e3 29 q UNDERSTANDING BACKPROPAGATION Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q) Here ek = (g(x(k),q)-y(k))2 On each iteration take a step to reduce ek E(q) Gradient of e3 30 q STOCHASTIC GRADIENT DESCENT Objective function values (measured over all examples) over time settle into local minimum Step size must be reduced over time, e.g., O(1/t) 31 CAVEATS Choosing a convergent “learning rate” e can be hard in practice E(q) q 32 COMMENTS AND ISSUES How to choose the size and structure of networks? If network is too large, risk of over-fitting (data caching) If network is too small, representation may not be rich enough Role of representation: e.g., learn the concept of an odd number Incremental learning Low interpretability 33 PERFORMANCE OF FUNCTION LEARNING Overfitting: too many parameters Regularization: penalize large parameter values Efficient optimization If E(q) is nonconvex, can only guarantee finding a local minimum Batch updates are expensive, stochastic updates converge slowly 34 READINGS R&N 18.8-9 HW5 due on Thursday 35
© Copyright 2026 Paperzz