lecture set 5 - Electrical and Computer Engineering

Predictive Learning
from Data
LECTURE SET 5
Nonlinear Optimization Strategies
Electrical and Computer Engineering
1
OUTLINE
Objectives
- describe common optimization strategies used for nonlinear
(adaptive) learning methods
- introduce terminology
- comment on implementation of model complexity control
•
•
•
•
•
•
Nonlinear Optimization in learning
Optimization basics (Appendix A)
Stochastic approximation (gradient descent)
Iterative methods
Greedy optimization
Summary and discussion
2
Optimization in predictive learning
•
•
•
Recall implementation of SRM:
- fix complexity (VC-dimension)
- minimize empirical risk
Two related issues:
- parameterization (of possible models)
- optimization method
Many learning methods use dictionary
m
parameterization f m x, w,V   wi gx,vi 
i0
•
Optimization methods vary
3
Nonlinear Optimization
•
The ERM approach
1 n
Remp  V, W    L  yi , f  xi , V, W  
n i 1
where the model
f x, V, W    w j g j x, v j 
m
j 1
Two factors contributing to nonlinear optimization
g j x, v j 
•
nonlinear basis functions
•
non-convex loss function L
Examples
convex loss: squared error, least-modulus
non-convex loss: 0/1 loss for classification
4
Unconstrained Minimization
•
•
•
•
ERM or SRM (with squared loss) lead to
unconstrained convex minimization
Minimization of a real-valued function of
many input variables
 Optimization theory
We discuss only popular optimization
strategies developed in statistics, machine
learning and neural networks
These optimization methods have been
introduced in various fields and usually
use specialized terminology
5
• Dictionary representation
Two possibilities
•
m
f m x, w,V   wi gx,vi 
i0
Linear (non-adaptive) methods
~ predetermined (fixed) basis functions g i x 
 only parameters wi have to be estimated
via standard optimization methods (linear least squares)
Examples: linear regression, polynomial regression
linear classifiers, quadratic classifiers
• Nonlinear (adaptive) methods
~ basis functions gx,vi  depend on the training data
Possibilities : nonlinear b.f. (in parameters v i )
feature selection (i.e. wavelet denoising), etc.
6
Nonlinear Parameterization: Neural Networks
•
MLP or RBF networks
m
ˆy   w j z j
m
j 1
f m x, w,V   wi gx,vi 
i0
W is m  1
z1
1
z2
2
zm
m
zj  gx,v j 
V is d  m
x1
x2
xd
- dimensionality reduction
- universal approximation property
- possibly multiple ‘hidden layers’ – aka “Deep Learning”
7
Multilayer Perceptron (MLP) network
• Basis functions of the form gi (t )  g (xvi  bi )
i.e. sigmoid aka logistic function
st  
1
1  exp  t 
- commonly used in artificial neural networks
- combination of sigmoids ~ universal approximator
8
RBF Networks
•
Basis functions of the form g i (t )  g  x  v i
i.e. Radial Basis Function(RBF)

 t2 
g  t   exp   2 
 2 

g t   t  b
2

2 2
gt   t
- RBF adaptive parameters: center, width
- commonly used in artificial neural networks
- combination of RBF’s ~ universal approximator
9
•
Piecewise-constant Regressionm (CART)
Adaptive Partitioning (CART) f x   w j I x R j 
j 1
each b.f. is a rectangular region in x-space
d
Ix R j   Iajl  xl  bjl 
l 1
•
•
Each b.f. depends on 2d parameters a j ,b j
Since the regions Rare
disjoint, parameters w
j
can be easily estimated (for regression) as:
1
wj 
yi

n j x i R j
•
Estimating b.f.’s ~ adaptive partitioning of x-space
10
Example of CART Partitioning
•
CART Partitioning in 2D space
- each region ~ basis function
- piecewise-constant estimate of y (in each region)
- number of regions m ~ model complexity
x2
s4
R5
R4
R1
s2
s3
R3
R2
s1
x1
11
OUTLINE
•
•
•
•
•
•
•
Objectives
Nonlinear Optimization in learning
Optimization basics (Appendix A)
Stochastic approximation aka gradient
descent
Iterative methods
Greedy optimization
Summary and discussion
12
Sequential estimation of model parameters
•
Batch vs on-line (iterative) learning
- Algorithmic (statistical) approaches ~ batch
- Neural-network inspired methods ~ on-line
BUT the only difference is on the implementation level (so
both types of methods should yield similar generalization)
•
Recall ERM inductive principle (for regression):
1 n
1 n
2
Remp w    Lx i , y i , w     y i  f x i , w 
n i 1
n i 1
•
Assume dictionary parameterization
with fixed basis fcts
m
ˆy  f x,w   w j g j x 
j 1
13
Sequential (on-line) least squares minimization
•
•
Training pairs x(k ), y (k ) presented sequentially
On-line update equations for minimizing
empirical risk (MSE) wrt parameters w are:

w k  1  w k    k
Lxk , y k , w 
w
(gradient descent learning)
where the gradient is computed via the chain rule:

 L  yˆ
Lx, y, w 
 2yˆ  ygj x 
wj
yˆ w j
the learning rate  k is a small positive value
(decreasing with k)
14
On-line least-squares minimization algorithm
•
Known as delta-rule (Widrow and Hoff, 1960):
Given initial parameter estimates w(0), update
parameters during each presentation of k-th
training sample x(k),y(k)
• Step 1: forward pass computation
zj k   gj x(k)
j 1,...,m
m
yˆk    wj k zj k 
•
- estimated output
Step 2: backward pass computation
 k   yˆk   yk  - error term (delta)
j 1
w j k 1  w j k    k  k z j k , j 1,...,m
15
Neural network interpretation of delta rule
•
Forward pass
Backward pass
ˆy k 
 k   ˆyk   yk 
w j  k    k  k z j k 
w0 k 
1
•
z1 k 
w1 k 
w j k  1  w j k   w j k 
wm k 
zm k 
1
“Biological learning”
- parallel+ distributed comp.
- can be extended to
multiple-layer networks
z1 k 
zm k 
Synapse
x
w
Hebbian Rule:
w ~ xy
y
16
Theoretical basis for on-line learning
•
Standard inductive learning: given training
data z1 ,...,z n find the model providing min of
prediction risk R   Lz,   pz dz

•
Stochastic Approximation guarantees
minimization of risk (asymptotically):
k 1  k    k grad Lz k , k 
under general conditions
on the learning rate:
lim  k  0
k


k 1
k


2

 k 
k 1
17
Practical issues for on-line learning
•
Given finite training set (n samples): z1 ,...,z n
this set is presented sequentially to a learning algorithm
many times. Each presentation of n samples is called
an epoch, and the process of repeated presentations is
called recycling (of training data)
•
Learning rate schedule: initially set large, then slowly
decreasing with k (iteration number). Typically ’good’
learning rate schedules are data-dependent.
Stopping conditions:
(1) monitor the gradient (i.e., stop when the gradient
falls below some small threshold)
(2) early stopping can be used for complexity control
•
18
OUTLINE
•
•
•
•
•
•
•
Objectives
Nonlinear Optimization in learning
Optimization basics (Appendix A)
Stochastic approximation aka gradient
descent methods
Iterative methods
Greedy optimization
Summary and discussion
19
•
Dictionary representation with fixed m :
m
f x, w, V    wi gi x , v i 
j 0
Iterative strategy for ERM (nonlinear optimization)
• Step 1: for current estimates of v̂ i update wi
• Step 2: for current estimates of ŵi update v i
Iterate Steps (1) and (2) above
Note: - this idea can be implemented for different problems:
e.g., unsupervised learning, supervised learning
Specific update rules depend on the type of problem & loss fct.
20
Iterative Methods
•
•
•
•
Expectation-Maximization (EM)
developed/used in statistics
Clustering and Vector Quantization in
neural networks and signal processing
Both originally proposed and commonly
used for unsupervised learning / density
estimation problems
But can be extended to supervised
learning settings
21
Vector Quantization and Clustering
•
Two complementary goals of VQ:
1. partition the input space into disjoint regions
2. find positions of units (coordinates of prototypes)
Note: optimal partitioning into regions is according to
the nearest-neighbor rule (~ the Voronoi regions)
22
GLA (Batch version)
Given data points x i i 1,...,n , loss function L (i.e.,
squared loss) and initial centers c j 0
j 1,...,m
Iterate the following two steps
1. Partition the data (assign sample x i to unit j )
using the nearest neighbor rule. Partitioning matrix Q:
1 if Lx i ,c j k   min Lx i ,cl k 
l
qij  
0 otherwise
2. Update unit coordinates as centroids of the data:
n
 qij x i
c j k  1  i1n
, j  1,... ,m
 qij
i 1
23
Statistical Interpretation of GLA
Iterate the following two steps
1. Partition the data (assign sample x i to unit j )
using the nearest neighbor rule. Partitioning matrix Q:
1 if Lx i ,c j k   min Lx i ,cl k 
l
qij  
0 otherwise
~ Projection of the data onto model space (units)
(aka Maximization Step)
2. Update unit coordinates
as centroids of the data:
n
 qij x i
c j k  1  i1n
, j  1,... ,m
 qij
i 1
~ Conditional expectation (averaging, smoothing)
‘conditional’ upon results of partitioning step (1)
24
GLA Example 1
•
Modeling doughnut distribution using 5 units
(a) initialization
(b) final position (of units)
25
OUTLINE
•
•
•
•
•
•
•
Objectives
Nonlinear Optimization in learning
Optimization basics (Appendix A)
Stochastic approximation (gradient
descent)
Iterative methods
Greedy optimization
Summary and discussion
26
Greedy Optimization Methods
•
Minimization of empirical risk (regression problems)
1 n
1 n
2
Remp V, W    Lx i , y i , V, W     y i  f x i , V, W 
n i 1
n i 1
where the model f x, V, W    w j g j x, v j 
m
j 1
•
Greedy Optimization Strategy
basis functions are estimated sequentially, one at a time,
i.e., the training data is represented as
structure (model fit) + noise (residual):
(1) DATA = (model) FIT 1 + RESIDUAL 1
(2) RESIDUAL 1 = FIT 2 + RESIDUAL 2
and so on. The final model for the data will be
MODEL = FIT 1 + FIT 2 + ....
•
Advantages: computational speed, interpretability
27
Regression Trees (CART)
•
Minimization of empirical risk (squared error)
via partitioning ofm the input space into regions
f x   w j I x R j 
j 1
•
Example of CART partitioning for a function of 2 inputs
1
x2
s4
2
R5
R4
R1
s2
s3
split 1 x1 ,s1 
3
R1
x2 ,s2 
x2 ,s3 
4
R3
x1 ,s4 
R2
s1
x1
R2
R3
R4
R5
28
Growing CART tree
•
•
•
•
•
•
Recursive partitioning for estimating regions
(via binary splitting)
Initial Model ~ Region R0
(the whole input domain)
is divided into two regions R1 and R2
A split is defined by one of the inputs(k) and split point s
Optimal values of (k, s) chosen so that splitting a region
into two daughter regions minimizes empirical risk
Issues:
- efficient implementation (selection of opt. split)
- optimal tree size ~ model selection(complexity control)
Advantages and limitations
29
CART Example
•
CART regression model for estimating Systolic
Blood Pressure (SBP) as a function of Age and
Obesity in a population of males in S. Africa
30
CART model selection
•
Model selection strategy
(1) Grow a large tree (subject to min leaf node size)
(2) Tree pruning by selectively merging tree nodes
•
The final model ~ minimizes penalized risk
•
•
where empirical risk ~ MSE
number of leaf nodes ~ T
regularization parameter ~ 
Note: larger   smaller trees
In practice: often user-defined (splitmin in Matlab)
R pen ,    Remp     T
31
a
bcy
Pitfalls of greedy optimization
•
Greedy binary splitting strategy may yield:
- sub-optimal solutions (regions R i )
- solutions very sensitive to random samples (especially
with correlated input variables)
•
Counterexample for CART:
estimate function f(a,b,c):
- which variable for the first split?
Choose a  3 errors
Choose b  4 errors
Choose c  4 errors
y a b c
0
0
1
1
1
1
0
0
0
1
0
1
0
1
1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
32
Counter example (cont’d)
(a) Suboptimal tree by CART
(b) Optimal binary tree
a
0
c
1
0
b
0
0,1
b
1
0
1
0,1
1
b
1
0
c
0
1
0,0
b
1
1,1
0
1,1
1
0,0
1
0,0
33
Backfitting Algorithm
•
Consider regression estimation of a function of
two variables of the form y  g1 x1   g2 x2   noise
from training data ( x1i , x2i , yi ) i  1,2,..., n
2
2
For example t ( x1 , x2 )  x1  sin( 2x2 ) x  0,1
Backfitting method:
(1) estimate g1 x1  for fixed g 2
(2) estimate g 2 x2  for fixed g1
•
iterate above two steps
Estimation via minimization
of empirical risk
n
Remp  g1 ( x1 ), g 2 ( x2 )  
1
2


y

g
(
x
)

g
(
x
)
 i 1 1i 2 2i
n i 1
1 n
2
( first _ iteration )    yi  g 2 ( x2i )  g1 ( x1i ) 
n i 1
1 n
2
  ri  g1 ( x1i ) 
n i 1
34
Backfitting Algorithm(cont’d)
•
Estimation of g1 ( x1 ) via minimization of MSE
1 n
2
Remp g1 ( x1 )    ri  g1 ( x1i )   min
n i 1
• This is a univariate regression problem of
estimating g1 x1  from n data points ( x1i , ri )
where ri  yi  g 2 ( x2i )
• Can be estimated by smoothing (kNN regression)
• Estimation of g 2 x2  (second iteration) proceeds
in a similar manner, via minimization of
1 n
2
Remp g 2 ( x2 )    ri  g 2 ( x2i )  where ri  yi  g1 ( x1i )
n i 1
•
Backfitting ~gradient descent in the function space35
Projection Pursuit regression
•
Projection Pursuit is an additive model:
m
f x,V, W   g j w j  x,v j   w0
j 1
where basis functions g j z,v j  are univariate
functions (of projections)
•
Regression model is a sum of ridge basis
functions that are estimated iteratively via
backfitting
36
Sum of two ridge functions for PPR
1
0.5
0.5
0
0
y
y
1
-0.5
-0.5
-1
1
-1
1
0.5
1
0.5
0
0
-0.5
x2
0.5
1
-1
x1
g1 ( z1 )  0.5z1  0.1
z1  ( x1  x2 )
0
-0.5
-0.5
-1
0.5
0
x2
-0.5
-1
-1
x1
g 2 ( z 2 )  0.1z 2
z2  ( x1  x2 )
2
37
Sum of two ridge functions for PPR
1
y
0.5
0
-0.5
-1
1
0.5
1
0.5
0
0
-0.5
x2
-0.5
-1
-1
x1
g1 ( x1 , x2 )  g 2 ( x1 , x2 )
38
Projection Pursuit regression
•
Projection Pursuit is an additive model:
f x,V, W   g w  x,v   w
m
j
j
j
0
where basis functions g j z,v j  are univariate
functions (of projections)
j 1
•
Backfitting algorithm is used to estimate
iteratively
(a) basis functions (parameters v j) via
scatterplot smoothing
(b) projection parameters w j (via gradient
descent)
39
OUTLINE
•
•
•
•
•
•
•
Objectives
Nonlinear Optimization in learning
Optimization basics (Appendix A)
Stochastic approximation aka gradient
descent methods
Iterative methods
Greedy optimization
Summary and discussion
40
Summary
•
•
Different interpretation of optimization
Consider dictionary parameterization
m
f m x, w,V   wi gx,vi 
i0
•
•
VC-theory: implementation of SRM
Gradient descent + iterative optimization
- SRM structure is specified a priori
- selection of m is separate from ERM
•
Greedy optimization strategy
- no a priori specification of a structure
- model selection is a part of optimization
41
Summary (cont’d)
•
Interpretation of greedy optimization
m
f m x, w,V   wi gx,vi 
i0
•
Statistical strategy for iterative data fitting
(1) Data = (model) Fit1 + Residual_1
(2) Residual_1 = (model) Fit2 + Residual_2
………..
 Model = Fit1 + Fit2 + …
•
This has superficial similarity to SRM
42