Predictive Learning from Data LECTURE SET 5 Nonlinear Optimization Strategies Electrical and Computer Engineering 1 OUTLINE Objectives - describe common optimization strategies used for nonlinear (adaptive) learning methods - introduce terminology - comment on implementation of model complexity control • • • • • • Nonlinear Optimization in learning Optimization basics (Appendix A) Stochastic approximation (gradient descent) Iterative methods Greedy optimization Summary and discussion 2 Optimization in predictive learning • • • Recall implementation of SRM: - fix complexity (VC-dimension) - minimize empirical risk Two related issues: - parameterization (of possible models) - optimization method Many learning methods use dictionary m parameterization f m x, w,V wi gx,vi i0 • Optimization methods vary 3 Nonlinear Optimization • The ERM approach 1 n Remp V, W L yi , f xi , V, W n i 1 where the model f x, V, W w j g j x, v j m j 1 Two factors contributing to nonlinear optimization g j x, v j • nonlinear basis functions • non-convex loss function L Examples convex loss: squared error, least-modulus non-convex loss: 0/1 loss for classification 4 Unconstrained Minimization • • • • ERM or SRM (with squared loss) lead to unconstrained convex minimization Minimization of a real-valued function of many input variables Optimization theory We discuss only popular optimization strategies developed in statistics, machine learning and neural networks These optimization methods have been introduced in various fields and usually use specialized terminology 5 • Dictionary representation Two possibilities • m f m x, w,V wi gx,vi i0 Linear (non-adaptive) methods ~ predetermined (fixed) basis functions g i x only parameters wi have to be estimated via standard optimization methods (linear least squares) Examples: linear regression, polynomial regression linear classifiers, quadratic classifiers • Nonlinear (adaptive) methods ~ basis functions gx,vi depend on the training data Possibilities : nonlinear b.f. (in parameters v i ) feature selection (i.e. wavelet denoising), etc. 6 Nonlinear Parameterization: Neural Networks • MLP or RBF networks m ˆy w j z j m j 1 f m x, w,V wi gx,vi i0 W is m 1 z1 1 z2 2 zm m zj gx,v j V is d m x1 x2 xd - dimensionality reduction - universal approximation property - possibly multiple ‘hidden layers’ – aka “Deep Learning” 7 Multilayer Perceptron (MLP) network • Basis functions of the form gi (t ) g (xvi bi ) i.e. sigmoid aka logistic function st 1 1 exp t - commonly used in artificial neural networks - combination of sigmoids ~ universal approximator 8 RBF Networks • Basis functions of the form g i (t ) g x v i i.e. Radial Basis Function(RBF) t2 g t exp 2 2 g t t b 2 2 2 gt t - RBF adaptive parameters: center, width - commonly used in artificial neural networks - combination of RBF’s ~ universal approximator 9 • Piecewise-constant Regressionm (CART) Adaptive Partitioning (CART) f x w j I x R j j 1 each b.f. is a rectangular region in x-space d Ix R j Iajl xl bjl l 1 • • Each b.f. depends on 2d parameters a j ,b j Since the regions Rare disjoint, parameters w j can be easily estimated (for regression) as: 1 wj yi n j x i R j • Estimating b.f.’s ~ adaptive partitioning of x-space 10 Example of CART Partitioning • CART Partitioning in 2D space - each region ~ basis function - piecewise-constant estimate of y (in each region) - number of regions m ~ model complexity x2 s4 R5 R4 R1 s2 s3 R3 R2 s1 x1 11 OUTLINE • • • • • • • Objectives Nonlinear Optimization in learning Optimization basics (Appendix A) Stochastic approximation aka gradient descent Iterative methods Greedy optimization Summary and discussion 12 Sequential estimation of model parameters • Batch vs on-line (iterative) learning - Algorithmic (statistical) approaches ~ batch - Neural-network inspired methods ~ on-line BUT the only difference is on the implementation level (so both types of methods should yield similar generalization) • Recall ERM inductive principle (for regression): 1 n 1 n 2 Remp w Lx i , y i , w y i f x i , w n i 1 n i 1 • Assume dictionary parameterization with fixed basis fcts m ˆy f x,w w j g j x j 1 13 Sequential (on-line) least squares minimization • • Training pairs x(k ), y (k ) presented sequentially On-line update equations for minimizing empirical risk (MSE) wrt parameters w are: w k 1 w k k Lxk , y k , w w (gradient descent learning) where the gradient is computed via the chain rule: L yˆ Lx, y, w 2yˆ ygj x wj yˆ w j the learning rate k is a small positive value (decreasing with k) 14 On-line least-squares minimization algorithm • Known as delta-rule (Widrow and Hoff, 1960): Given initial parameter estimates w(0), update parameters during each presentation of k-th training sample x(k),y(k) • Step 1: forward pass computation zj k gj x(k) j 1,...,m m yˆk wj k zj k • - estimated output Step 2: backward pass computation k yˆk yk - error term (delta) j 1 w j k 1 w j k k k z j k , j 1,...,m 15 Neural network interpretation of delta rule • Forward pass Backward pass ˆy k k ˆyk yk w j k k k z j k w0 k 1 • z1 k w1 k w j k 1 w j k w j k wm k zm k 1 “Biological learning” - parallel+ distributed comp. - can be extended to multiple-layer networks z1 k zm k Synapse x w Hebbian Rule: w ~ xy y 16 Theoretical basis for on-line learning • Standard inductive learning: given training data z1 ,...,z n find the model providing min of prediction risk R Lz, pz dz • Stochastic Approximation guarantees minimization of risk (asymptotically): k 1 k k grad Lz k , k under general conditions on the learning rate: lim k 0 k k 1 k 2 k k 1 17 Practical issues for on-line learning • Given finite training set (n samples): z1 ,...,z n this set is presented sequentially to a learning algorithm many times. Each presentation of n samples is called an epoch, and the process of repeated presentations is called recycling (of training data) • Learning rate schedule: initially set large, then slowly decreasing with k (iteration number). Typically ’good’ learning rate schedules are data-dependent. Stopping conditions: (1) monitor the gradient (i.e., stop when the gradient falls below some small threshold) (2) early stopping can be used for complexity control • 18 OUTLINE • • • • • • • Objectives Nonlinear Optimization in learning Optimization basics (Appendix A) Stochastic approximation aka gradient descent methods Iterative methods Greedy optimization Summary and discussion 19 • Dictionary representation with fixed m : m f x, w, V wi gi x , v i j 0 Iterative strategy for ERM (nonlinear optimization) • Step 1: for current estimates of v̂ i update wi • Step 2: for current estimates of ŵi update v i Iterate Steps (1) and (2) above Note: - this idea can be implemented for different problems: e.g., unsupervised learning, supervised learning Specific update rules depend on the type of problem & loss fct. 20 Iterative Methods • • • • Expectation-Maximization (EM) developed/used in statistics Clustering and Vector Quantization in neural networks and signal processing Both originally proposed and commonly used for unsupervised learning / density estimation problems But can be extended to supervised learning settings 21 Vector Quantization and Clustering • Two complementary goals of VQ: 1. partition the input space into disjoint regions 2. find positions of units (coordinates of prototypes) Note: optimal partitioning into regions is according to the nearest-neighbor rule (~ the Voronoi regions) 22 GLA (Batch version) Given data points x i i 1,...,n , loss function L (i.e., squared loss) and initial centers c j 0 j 1,...,m Iterate the following two steps 1. Partition the data (assign sample x i to unit j ) using the nearest neighbor rule. Partitioning matrix Q: 1 if Lx i ,c j k min Lx i ,cl k l qij 0 otherwise 2. Update unit coordinates as centroids of the data: n qij x i c j k 1 i1n , j 1,... ,m qij i 1 23 Statistical Interpretation of GLA Iterate the following two steps 1. Partition the data (assign sample x i to unit j ) using the nearest neighbor rule. Partitioning matrix Q: 1 if Lx i ,c j k min Lx i ,cl k l qij 0 otherwise ~ Projection of the data onto model space (units) (aka Maximization Step) 2. Update unit coordinates as centroids of the data: n qij x i c j k 1 i1n , j 1,... ,m qij i 1 ~ Conditional expectation (averaging, smoothing) ‘conditional’ upon results of partitioning step (1) 24 GLA Example 1 • Modeling doughnut distribution using 5 units (a) initialization (b) final position (of units) 25 OUTLINE • • • • • • • Objectives Nonlinear Optimization in learning Optimization basics (Appendix A) Stochastic approximation (gradient descent) Iterative methods Greedy optimization Summary and discussion 26 Greedy Optimization Methods • Minimization of empirical risk (regression problems) 1 n 1 n 2 Remp V, W Lx i , y i , V, W y i f x i , V, W n i 1 n i 1 where the model f x, V, W w j g j x, v j m j 1 • Greedy Optimization Strategy basis functions are estimated sequentially, one at a time, i.e., the training data is represented as structure (model fit) + noise (residual): (1) DATA = (model) FIT 1 + RESIDUAL 1 (2) RESIDUAL 1 = FIT 2 + RESIDUAL 2 and so on. The final model for the data will be MODEL = FIT 1 + FIT 2 + .... • Advantages: computational speed, interpretability 27 Regression Trees (CART) • Minimization of empirical risk (squared error) via partitioning ofm the input space into regions f x w j I x R j j 1 • Example of CART partitioning for a function of 2 inputs 1 x2 s4 2 R5 R4 R1 s2 s3 split 1 x1 ,s1 3 R1 x2 ,s2 x2 ,s3 4 R3 x1 ,s4 R2 s1 x1 R2 R3 R4 R5 28 Growing CART tree • • • • • • Recursive partitioning for estimating regions (via binary splitting) Initial Model ~ Region R0 (the whole input domain) is divided into two regions R1 and R2 A split is defined by one of the inputs(k) and split point s Optimal values of (k, s) chosen so that splitting a region into two daughter regions minimizes empirical risk Issues: - efficient implementation (selection of opt. split) - optimal tree size ~ model selection(complexity control) Advantages and limitations 29 CART Example • CART regression model for estimating Systolic Blood Pressure (SBP) as a function of Age and Obesity in a population of males in S. Africa 30 CART model selection • Model selection strategy (1) Grow a large tree (subject to min leaf node size) (2) Tree pruning by selectively merging tree nodes • The final model ~ minimizes penalized risk • • where empirical risk ~ MSE number of leaf nodes ~ T regularization parameter ~ Note: larger smaller trees In practice: often user-defined (splitmin in Matlab) R pen , Remp T 31 a bcy Pitfalls of greedy optimization • Greedy binary splitting strategy may yield: - sub-optimal solutions (regions R i ) - solutions very sensitive to random samples (especially with correlated input variables) • Counterexample for CART: estimate function f(a,b,c): - which variable for the first split? Choose a 3 errors Choose b 4 errors Choose c 4 errors y a b c 0 0 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 32 Counter example (cont’d) (a) Suboptimal tree by CART (b) Optimal binary tree a 0 c 1 0 b 0 0,1 b 1 0 1 0,1 1 b 1 0 c 0 1 0,0 b 1 1,1 0 1,1 1 0,0 1 0,0 33 Backfitting Algorithm • Consider regression estimation of a function of two variables of the form y g1 x1 g2 x2 noise from training data ( x1i , x2i , yi ) i 1,2,..., n 2 2 For example t ( x1 , x2 ) x1 sin( 2x2 ) x 0,1 Backfitting method: (1) estimate g1 x1 for fixed g 2 (2) estimate g 2 x2 for fixed g1 • iterate above two steps Estimation via minimization of empirical risk n Remp g1 ( x1 ), g 2 ( x2 ) 1 2 y g ( x ) g ( x ) i 1 1i 2 2i n i 1 1 n 2 ( first _ iteration ) yi g 2 ( x2i ) g1 ( x1i ) n i 1 1 n 2 ri g1 ( x1i ) n i 1 34 Backfitting Algorithm(cont’d) • Estimation of g1 ( x1 ) via minimization of MSE 1 n 2 Remp g1 ( x1 ) ri g1 ( x1i ) min n i 1 • This is a univariate regression problem of estimating g1 x1 from n data points ( x1i , ri ) where ri yi g 2 ( x2i ) • Can be estimated by smoothing (kNN regression) • Estimation of g 2 x2 (second iteration) proceeds in a similar manner, via minimization of 1 n 2 Remp g 2 ( x2 ) ri g 2 ( x2i ) where ri yi g1 ( x1i ) n i 1 • Backfitting ~gradient descent in the function space35 Projection Pursuit regression • Projection Pursuit is an additive model: m f x,V, W g j w j x,v j w0 j 1 where basis functions g j z,v j are univariate functions (of projections) • Regression model is a sum of ridge basis functions that are estimated iteratively via backfitting 36 Sum of two ridge functions for PPR 1 0.5 0.5 0 0 y y 1 -0.5 -0.5 -1 1 -1 1 0.5 1 0.5 0 0 -0.5 x2 0.5 1 -1 x1 g1 ( z1 ) 0.5z1 0.1 z1 ( x1 x2 ) 0 -0.5 -0.5 -1 0.5 0 x2 -0.5 -1 -1 x1 g 2 ( z 2 ) 0.1z 2 z2 ( x1 x2 ) 2 37 Sum of two ridge functions for PPR 1 y 0.5 0 -0.5 -1 1 0.5 1 0.5 0 0 -0.5 x2 -0.5 -1 -1 x1 g1 ( x1 , x2 ) g 2 ( x1 , x2 ) 38 Projection Pursuit regression • Projection Pursuit is an additive model: f x,V, W g w x,v w m j j j 0 where basis functions g j z,v j are univariate functions (of projections) j 1 • Backfitting algorithm is used to estimate iteratively (a) basis functions (parameters v j) via scatterplot smoothing (b) projection parameters w j (via gradient descent) 39 OUTLINE • • • • • • • Objectives Nonlinear Optimization in learning Optimization basics (Appendix A) Stochastic approximation aka gradient descent methods Iterative methods Greedy optimization Summary and discussion 40 Summary • • Different interpretation of optimization Consider dictionary parameterization m f m x, w,V wi gx,vi i0 • • VC-theory: implementation of SRM Gradient descent + iterative optimization - SRM structure is specified a priori - selection of m is separate from ERM • Greedy optimization strategy - no a priori specification of a structure - model selection is a part of optimization 41 Summary (cont’d) • Interpretation of greedy optimization m f m x, w,V wi gx,vi i0 • Statistical strategy for iterative data fitting (1) Data = (model) Fit1 + Residual_1 (2) Residual_1 = (model) Fit2 + Residual_2 ……….. Model = Fit1 + Fit2 + … • This has superficial similarity to SRM 42
© Copyright 2026 Paperzz