Linear Models for Regression

Machine Learning
Srihari
Linear Models for Regression
Sargur Srihari
[email protected]
1
Machine Learning
Srihari
Linear Regression with One Input
•  Simplest form of linear regression:
–  linear function of single input variable
Task is to learn weights w0, w1
–  y(x,w) = w0+w1x
from data D={(y ,x )}i=1,,N
•  More useful class of functions:
i
i
–  Polynomial curve fitting
–  y(x,w) = w0+w1x+w2x2+…=Σ wixi
–  linear combination of non-linear functions of
input variables φi (x) instead of xi called basis
functions
•  Linear functions of parameters (which gives them
simple analytical properties), yet are nonlinear with2
respect to input variables
Machine Learning
Srihari
Plan of Discussion
• 
• 
• 
• 
Discuss supervised learning starting with regression
Goal: predict value of one or more target variables t
Given d-dimensional vector x of input variables
Terminology
–  Regression
•  When t is continuous-valued
–  Classification
•  if t has a value consisting of labels (non-ordered categories)
–  Ordinal Regression
•  Discrete values, ordered categories
3
Machine Learning
Srihari
Regression with multiple inputs
•  Polynomial
–  Single input variable scalar x
•  Generalizations
Target
Variable
Input Variable
–  Predict value of continuous target variable t given value
of d input variables x=[x1,..xd]
–  t can also be a set of variables (multiple regression)
–  Linear functions of adjustable parameters
•  Specifically linear combinations of nonlinear functions of input
variable
4
Machine Learning
Srihari
Simplest Linear Model with d inputs
•  Regression with d input variables
y(x,w) = w0+w1x1+..+wd xd = wTx
This differs from
Linear Regression with one variable
and Polynomial Reg with one variable
where x=(x1,..,xd)T are the input variables
•  Called Linear Regression since it is a linear function of
–  parameters w0,..,wd
–  input variables x1,..,xd
•  Significant limitation since it is a linear function of input
variables
–  In the one-dimensional case this amounts a straight-line fit (degree-one
polynomial)
5
–  y(x,w) = w0 + w1x
Machine Learning
Srihari
Fitting a Regression Plane
•  Assume t is a function of inputs x1, x2,...xd
(independent variables). Goal is to find the
best linear regressor of t on all the inputs.
•  For d =2: fitting a plane through N input
samples or a hyperplane in d dimensions.
x1
x2
t
6
Machine Learning
Srihari
Learning to Rank Problem
•  LeToR
•  Multiple Inputs
•  Target Value
–  t is discrete (eg, 1,2..6) in training set but a continuous
value in [1,6] is learnt and used to rank objects
7
Machine Learning
Srihari
Regression with multiple inputs: LeToR
Input (xi):
(d Features of Query-URL pair)
(d >200)
In LETOR 4.0 dataset
46 query-document features
Maximum of 124 URLs/query
–  Log frequency of query in
anchor text
–  Query word in color on page
–  # of images on page
–  # of (out) links on page
–  PageRank of page
–  URL length
–  URL contains “~”
–  Page length
Traditional IR uses TF/IDF
Output (y):
Relevance Value
Target Variable
-  Point-wise (0,1,2,3)
-  Regression returns continuous value
Yahoo! data set has d=700
- Allows fine-grained ranking of URLs
Machine Learning
Srihari
Input Features
See http://research.microsoft.com/en-us/projects/mslr/feature.aspx
Feature List of Microsoft Learning to Rank Datasets
feature id feature description
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
stream
comments
body
anchor
covered query term
title
number
url
whole document
body
anchor
covered query term
title
ratio
url
whole document
body
anchor
stream length
title
url
whole document
body
anchor
IDF(Inverse document
title
frequency)
url
whole document
body
anchor
sum of term frequency title
url
whole document
IDF(t,D)=log N / |{d ∈ D: t ∈d}|
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
body
anchor
min of term frequency title
url
whole document
body
anchor
max of term frequency title
url
whole document
body
anchor
mean of term frequency title
url
whole document
body
anchor
variance of term
title
frequency
url
whole document
body
sum of stream length
anchor
normalized term
title
frequency
url
whole document
Machine Learning
Srihari
Feature Statistics
•  Most of 46 features are normalized as continuous
values from 0 to 1, exception some features are
all 0s’.
Feature
Min
Max
Mean
1
0
1
0.254
2
0
1
0.1598
3
0
1
0.1392
4
0
1
0.2158
5
0
1
0.1322
6
0
0
0.1614
7
0
0
0
8
0
0
0
9
0
0
0
10
0
0
0
11
0
1
0
12
0
1
0.2841
13
0
1
0.1382
14
0
1
0.2109
15
0
1
0.1218
16
0
1
0.2879
17
0
1
0.1533
18
0
1
0.2258
19
0
1
0.3057
20
0
1
0.3332
21
0
1
0.1534
22
0
1
0.5473
23
0
1
0.5592
24
0
1
0.5453
25
0
1
0.5622
26
0
1
0.1675
27
0
1
0.1377
28
0
1
0.1249
29
0
1
0.126
30
0
1
0.2109
31
0
1
0.1705
32
0
1
0.1694
33
0
1
0.1603
45
0
1
0.1065
46
0
1
0.1211
34
0
1
0.1275
35
0
1
0.0762
36
0
1
0.0762
37
0
1
0.0728
38
0
1
0.5479
39
0
1
0.5574
40
0
1
0.5502
41
0
1
0.5673
42
0
1
0.4321
43
0
0
0.3361
44
0
1
0
Machine Learning
Srihari
Linear Regression with M Basis Functions
•  Extended by considering nonlinear functions of input variables
M −1
y(x, w ) = w0 + ∑ w jφ j (x)
j =1
–  where φj(x) are called Basis functions
–  We now need M weights for basis functions instead of d weights for
features
–  Can be written as
M −1
y(x, w) = ∑ w j φ j (x) = wT φ (x)
j= 0
–  where w=(w0,w1,..,wM-1) and Φ =(φ0,φ1,..,φM-1)T
•  Basis functions allow non-linearity with d input variables
11
Machine Learning
Srihari
Polynomial Basis with one variable
•  Linear Basis Function Model
M −1
y(x, w) = ∑ w j φ j (x) = wT φ (x)
xj
j= 0
•  Polynomial Basis (for single variable x)
  φj(x)=x j with degree M-1 polynomial
•  Disadvantage
x
–  Global:
•  changes in one region of input space affects others
–  Difficult to formulate
•  Number of polynomials increases exponentially with M
–  Can divide input space into regions
•  use different polynomials in each region:
•  equivalent to spline functions
12
Machine Learning
Srihari
Polynomial Basis with d variables
(Not used in practice!)
•  Consider (for a vector x) the basis: φ j (x) =|| x || =
[
]
j
x + x + ..+ x
–  x=(2,1) and x= (1,2) have the same squared sum, so it is unsatisfactory
–  Vector is being converted into a scalar value thereby losing information
j
2
1
2
2
2
d
•  Better approach:
–  Polynomial of degree M-1 has terms with variables taken none, one,
two… M-1 at a time.
–  Use multi-index j=(j1,j2,..jd) such that j1+j2+..jd < M-1
–  For a quadratic (M=3) with three variables (d=3)
y(x,w) =
∑ w φ (x) = w
j
j
0
+ w1,0,0 x1 + w 0,1,0 x 2 + w 0,0,1 x 3 + w1,1,0 x1 x 2 + w1,0,1 x1 x 3 +
( j1 , j 2 , j 3 )
w 0,1,1 x 2 x 3 + w 2,0,0 x12 + w 0,2,0 x 22 + w 0,0,2 x 32
–  Number of quadratic terms is 1+d+d(d-1)/2+d
–  For d=46, it is 1128
Machine Learning
Srihari
Gaussian Radial Basis Functions
•  Gaussian
" (x − µ j )2 %
φ j (x) = exp $$
''
2
# 2s
&
–  Does not necessarily have a probabilistic interpretation
–  Usual normalization term is unimportant
•  since basis function is multiplied by weight wj
•  Choice of parameters
–  µj govern the locations of the basis functions
•  Can be an arbitrary set of points within the range of the data
–  Can choose some representative data points
–  s governs the spatial scale
•  Could be chosen from the data set e.g., average variance
•  Several variables
–  A Gaussian kernel would be chosen for each dimension
–  For each j a different set of means would be needed– perhaps chosen from
the data
⎛
⎞
1
φj (x) = exp ⎜⎜⎜− (x − µ j )t Σ−1(x − µ j )⎟⎟⎟
⎟⎠
⎝ 2
Machine Learning
Srihari
Sigmoidal Basis Function
•  Sigmoid
" x −µj %
ϕ j (x) = σ $
'
# s &
where σ (a) =
1
1+ exp(−a)
•  Equivalently, tanh because it is related to logistic
sigmoid by
tanh( a) = 2σ (a) − 1
Logistic Sigmoid
Machine Learning
Srihari
Other Basis Functions
•  Fourier
–  Expansion in sinusoidal functions
–  Infinite spatial extent
•  Signal Processing
–  Functions localized in time and frequency
–  Called wavelets
•  Useful for lattices such as images and time series
•  Further discussion independent of choice
of basis including φ(x) = x
16
Machine Learning
Srihari
Probabilistic Formulation
•  Given:
–  Data set of n observations {xn}, n=1,.., N
–  Corresponding target values {tn}
•  Goal:
–  Predict value of t for a new value of x
•  Simplest approach:
–  Construct function y(x) whose values are the predictions
•  Probabilistic Approach:
–  Model predictive distribution p(t|x)
•  It expresses uncertainty about value of t for each x
–  From this conditional distribution
•  Predict t for any value of x so as to minimize a loss function
•  Typical loss is squared loss for which the solution is the conditional
expectation of t
17
Machine Learning
Srihari
Relationship between Maximum Likelihood
and Least Squares
•  Will show that Minimizing sum-of-squared errors is
the same as maximum likelihood solution under a
Gaussian noise model
•  Target variable is a scalar t given by deterministic
function y(x,w) with additive Gaussian noise
t = y(x,w)+∈
∈ is zero-mean Gaussian with precision β
•  Thus distribution of t is univariate normal:
p(t|x,w,β)=N(t|y(x,w),β -1)
mean
variance
Machine Learning
Srihari
Likelihood Function
•  Data set:
–  Input X={x1,..xN} with target t = {t1,..tN}
–  Target variables tn are scalars forming a vector of size N
•  Likelihood of the target data
–  It is the probability of observing the data assuming they
are independent
N
p(t | X,w, β ) = ∏ N ( t n | wT φ (x n ), β −1 )
n=1
Machine Learning
Srihari
Log-Likelihood Function
N
p(t | X,w, β ) = ∏ N ( t n | wT φ (x n ), β −1 )
•  Likelihood
n=1
•  Log-likelihood
N
ln p(t | w, β ) = ∑ ln N ( t n | wT φ (x n ), β −1 )
n=1
N
N
= ln β − ln2π − βE D (w)
2
2
–  Where
{
1 N
ED ( w ) = ∑ t n − w T φ ( xn )
2 n =1
2
}
Sum-of-squares
Error Function
–  We have used standard form of univariate Gaussian
⎧ 1
1
2⎫
N(x | µ,σ ) =
exp⎨− 2 (x − µ) ⎬
⎩ 2σ
⎭
(2πσ 2 )1/ 2
2
Machine Learning
Srihari
Maximizing Log-Likelihood Function
•  Log-likelihood
N
ln p(t | w, β ) = ∑ ln N ( t n | wT φ (x n ), β −1 )
n=1
=
–  Where
{
1 N
ED ( w ) = ∑ t n − w T φ ( xn )
2 n =1
2
}
N
N
ln β − ln2π − βE D (w)
2
2
Sum-of-squares
Error Function
•  Maximizing likelihood is equivalent to minimizing ED(w)
•  Take derivative of ln p(t | w, β ) wrt w and set equal to zero
and solve for w
–  or equivalently derivative of -ED(w)
Machine Learning
Srihari
Gradient of Log-likelihood wrt w
N
{
∇ ln p(t | w, β ) = ∑ tn − wT φ (x n )
n=1
Since
∇ w [−
} φ (x )
T
n
1
( a − wb)2 ] = (a − wb)b
2
•  Gradient is set to zero and we solve for w
#N
&
0 = ∑ tnφ (x n ) − w % ∑φ (x n )φ (x n )T (
$ n=1
'
n=1
N
T
as shown in next slide
•  Second derivative will be negative making
this a maximum
Machine Learning
Srihari
Max Likelihood Solution for w
•  Solving for w:
w ML = Φ + t
X={x1,..xN} are samples
(vectors of d variables)
t = {t1,..tN} are targets (scalars)
where Φ + = (ΦT Φ) −1 ΦT is the Moore-Penrose pseudo inverse of N x
M Design Matrix
Design Matrix:
Rows correspond to
N samples,
Columns to
M basis functions
⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
⎜
⎟
⎜ φ (x )
⎟
Φ=⎜ 0 2
⎟
⎜
⎟
⎜φ (x )
φ M −1( x N ) ⎟⎠
⎝ 0 N
Pseudo inverse:
generalization of notion of matrix inverse
to non-square matrices
If design matrix is square and invertible.
then pseudo-inverse is same as inverse
ϕi(xn) are M basis functions, e.g., Gaussians centered on M data points
⎛ 1
⎞⎟
t −1
⎜
φj (x) = exp ⎜− (x − µ j ) Σ (x − µ j )⎟⎟
⎟⎠
⎜⎝ 2
Machine Learning
Srihari
What is the role of Bias parameter w0
M −1
• 
• 
• 
• 
Regression function:
Error Function:
Explicit bias parameter:
Setting derivatives wrt w0
y(x, w ) = w0 + ∑ w jφ j (x)
j =1
{
1 N
ED (w) = ∑ t n −wT φ (x n )
2 n=1
M −1
1 N ⎧⎪
ED (w) = ∑ ⎨ t n −w0 − ∑ w jφ j (x n )
2 n=1 ⎪⎩
j=1
⎫⎪
⎬
⎪⎭
2
M −1
equal to zero and solving for w0
where
}
2
1 N
t = ∑ tn
N n=1
w0 = t − ∑ w j φ j
j=1
N
1
φ j = ∑ φ j (x n )
N n=1
•  First term is average of the N values of t
•  Second is weighted average of basis function values
•  Thus bias w0 compensates for difference
•  between average target values and weighted sum of
averages of basis function values
Machine Learning
Srihari
Maximum Likelihood for precision β
•  We have determined m.l.e. solution for w
using a probabilistic formulation
•  p(t|x,w,β)=N(t|y(x,w),β -1)
•  With log-likelihood
w ML = Φ + t
N
N
ln p(t | w, β ) = ln β − ln 2π − β ED (w)
2
2
•  Taking gradient wrt β gives
{
1
1 N
= ∑ tn − wTMLφ (x n )
β ML N n=1
}
{
1 N
ED (w) = ∑ t n −wT φ (x n )
2 n=1
2
•  Thus Inverse of the noise precision gives
Residual variance of the target values
25
around the regression function
}
2
Machine Learning
Srihari
Geometry of Least Squares Solution
•  N-dim space with axes tn: t =(t1,…,tN)T
•  Each basis function φj(xn), corresponding
to jth column of Φ, is represented in this
space as vector ϕ j
subspace
–  If M<N then φj(xn) are in subspace S of dim M
•  y is N-dim vector of elements y(xn,w)
•  Sum-of-squares error is equal to squared
M Basis functions
Euclidean distance between y and t
⎛ φ ( x ) φ ( x ) ... φ ( x ) ⎞
•  Solution w corresponds to y that lies in
⎜
⎟
φ
(
x
)
⎜
⎟
subspace S that is closest to t
Φ=
1
0
2
1
⎜
⎜
⎜φ (x )
⎝ 0 N
1
M −1
1
⎟
⎟
φ M −1( x N ) ⎟⎠
N Data
–  Corresponds to ortho projection of t onto S
0
Represented as N-dim vectors
ϕ0
ϕ1
ϕ M-1
Machine Learning
Srihari
Difficulty of Direct solution
•  Direct solution of normal equations
w ML = Φ + t
Φ + = (ΦT Φ) −1 ΦT
–  Can lead to numerical difficulties
•  When two basis functions are collinear, ΦTΦ is singular
(determinant=0)
–  Parameters can have large magnitudes
•  Not uncommon with real data sets
•  Can be addressed using
–  Singular Value Decomposition
–  Addition of regularization term ensures matrix is non-singular
27
Machine Learning
Srihari
Sequential (On-line) Learning
•  Maximum likelihood solution is
w ML = (ΦT Φ)−1 ΦT t
•  It is a batch technique
–  Processing entire training set in one go
•  It is computationally expensive for large data
sets
•  Due to huge N x M Design matrix Φ
•  Solution is to use a sequential algorithm where
samples are presented one at a time
•  Called Stochastic gradient descent
28
Machine Learning
Stochastic Gradient Descent
•  Error function
Srihari
2
{
} sums over data
–  Denoting ED(w)=ΣnEn update parameter vector w
using
1 N
ED (w) = ∑ t n −wT ϕ (x n )
2 n=1
w (τ +1) = w (τ ) − η∇En
•  Substituting for the derivative
w(τ +1) = w(τ ) + η(t n − w(τ )φ (x n ))φ (x n )
•  w is initialized to some starting vector w(0)
•  η chosen with care to ensure convergence
•  Known as Least Mean Squares Algorithm
29
Machine Learning
Srihari
Derivation of Gradient Descent Rule
30
Error surface shows desirability of
every weight vector, a parabola with
a single global minimum
Machine Learning
Srihari
Simple Gradient Descent
Procedure Gradient-Descent (
θ1 //Initial starting point
f //Function to be minimized
δ //Convergence threshold
)
1 t←1
2 do
( )
3
θ t+1 ← θ t − η∇f θ t
4
t ← t +1
5 while || θ t − θ t−1 ||> δ
( )
Intuition
Taylor’s expansion of function f(θ) in the
t
t T
t
neighborhood of θt is f (θ) ≈ f (θ ) + (θ − θ ) ∇f (θ )
Let θ=θt+1 =θt +h, thus f (θt+1 ) ≈ f (θt ) + h ∇f (θt )
Derivative of f(θt+1) wrt h is ∇f (θt )
t
At h = ∇f (θ ) a maximum occurs (since h2 is
positive) and at h = −∇f (θt ) a minimum occurs.
Alternatively,
t
The slope ∇f (θ ) points to the direction of
steepest ascent. If we take a step η in the
opposite direction we decrease the value of f
6 return θ t
One-dimensional example
Let f(θ)=θ2
This function has minimum at θ=0 which we want
to determine using gradient descent
We have f’(θ)=2θ
For gradient descent, we update by –f’(θ)
If θt > 0 then θt+1<θt
If θt<0 then f’(θt)=2θt is negative, thus θt+1>θt
Machine Learning
Srihari
Difficulties with Simple Gradient Descent
•  Performance depends on choice of learning rate η
•  Large learning rate
–  “Overshoots”
•  Small learning rate
–  Extremely slow
•  Solution
–  Start with large η and settle on optimal value
–  Need a schedule for shrinking η
32
Machine Learning
Srihari
Improvements to Gradient Descent
Procedure Gradient-Descent (
θ1 //Initial starting point
f //Function to be minimized
δ //Convergence threshold
)
1 t←1
2 do
( )
3
θ t+1 ← θ t − η∇f θ t
4
t ← t +1
Line Search
Adaptively choose η at each step
Define !a “line” in the direction of gradient
g (η ) = θ + η∇f (θ )
Find three points η1 < η2 < η3 so that
f(g(η2)) is smaller than at both others
η’=(η1+η2)/2
t
t
5 while || θ t − θ t−1 ||> δ
( )
6 return θ t
Brent’s method: Solid line is f(g(η))
Three points bracket its maximum
Dashed line shows quadratic fit
Machine Learning
Conjugate Gradient Ascent
Srihari
•  In simple gradient ascent:
–  two consecutive gradients are orthogonal:
–  ∇fobj (θt+1 ) is orthogonal to ∇fobj (θt )
–  Progress is zig-zag: progress to maximum slows down
Quadratic: fobj(x,y)=-(x2+10y2)
Exponential: fobj(x,y)=exp [-(x2+10y2)]
•  Solution:
–  Remember past directions and bias new direction ht
as combination of current gt and past ht-1
–  Faster convergence than simple gradient ascent
Machine Learning
Srihari
Steepest descent
•  Gradient descent search determines a
weight vector w that minimizes ED(w) by
–  Starting with an arbitrary initial weight vector
–  Repeatedly modifying it in small steps
–  At each step, weight vector is modified in the
direction that produces the steepest descent
along the error surface
35
Machine Learning
Srihari
Definition of Gradient Vector
•  The Gradient (derivative) of E with respect to each component of
the vector w
!" ⎡ ∂E ∂E
∂E ⎤
∇E[w ] ≡ ⎢
,
,#
⎥
∂w
∂w
∂w
⎣ 0
1
n⎦
•  Notice ∇E[w]
is a vector of partial derivatives
•  Specifies the direction that produces steepest increase in E
•  Negative of this vector specifies direction of steepest decrease
36
Machine Learning
Srihari
Direction of Steepest Descent
Negated gradient
Direction in
w0-w1 plane
producing
steepest
descent
37
Machine Learning
Srihari
Gradient Descent Rule
!
!
w ← w + Δw
–  where
– 
η
Δ w = −η∇E[w]
is a positive constant called the learning rate
•  Determines step size of gradient descent search
•  Component Form of Gradient Descent
–  Can also be written as
wi ← wi + Δwi
•  where
38
∂E
Δwi = −η
∂wi
Machine Learning
Srihari
Regularized Least Squares
•  Regularization term in error to control overfit
–  Error function to minimize
E(w) = ED (w) + λ EW (w)
•  where λ is the regularization coefficient
controls relative importance of data-dependent error ED(w) and
regularization term EW(w)
–  Simple form of regularizer EW ( w ) = 1 w T w
2
–  Total error function becomes
“Quadratic” Regularizer
{
1 N
E(w) = ∑ tn − wT φ (xn )
2 n=1
}
2
+
λ T
w w
2
–  This regularizer is called weight decay
•  because in sequential learning weight values decay
39
towards zero unless supported by data
Machine Learning
Srihari
Quadratic Regularizer
2
1 N
λ
E(w) = ∑{ t n − wT φ (x n ) } + wT w
2 n=1
2
•  Advantage: error function remains a quadratic
function of w
•  Its exact minimizer can be found in closed form
•  Setting gradient to zero and solving for w
−1
w = ( λI + Φ Φ) Φ t
T
T
•  This is a simple extension of the least squared
solution
w ML = (ΦT Φ)−1 ΦT t
Machine Learning
Srihari
Generalization of Quadratic Regularizer
N
M
2
1
λ
•  Regularized Error
t n − wT φ (x n ) } + ∑ | w j |q
{
∑
2 n=1
2 j=1
•  Where q=2 corresponds to the quadratic regularizer
q=1 is known as lasso
•  Contours of constant regularization term: |wj|q
Space of w1,w2
w1 + w2 = const
w12 + w22 = const
w1 + w2 = const
w14 + w24 = const
41
Lasso
Quadratic
Machine Learning
Srihari
Geometric Interpretation of Regularizer
w2
Contours of
Unregularized
Error function
We are trying to find w that minimizes
N
∑{ t
− w φ (x n ) }
T
n
2
n=1
Any value of w on contour
gives same error
Choose that value of w
Subject to the constraint
M
w1
q
|
w
|
∑ j ≤η
j=1
We don’t want the weights to become too large
The two approaches are related by using Lagrange
multipliers
42
Machine Learning
Srihari
Minimization of Unregularized Error
subject to constraint
•  Blue: Contours of
unregularized error
function
•  Constraint region
•  w* is optimum value
Minimization
With quadratic
Regularizer,
q=2
43
Machine Learning
Srihari
Sparsity with Lasso constraint
•  With q=1 and λ is sufficiently large, some of the
coefficients wj are driven to zero
•  Leads to a sparse model
–  where corresponding basis functions play no role
•  Origin of sparsity is illustrated here:
Quadratic solution where
w1*and wo* are nonzero
Minimization with Lasso Regularizer
A sparse solution with w1*=0
44
Machine Learning
Srihari
Regularization: Conclusion
•  Regularization allows
–  complex models to be trained on small data sets
–  without severe over-fitting
•  It limits model complexity
–  i.e., how many basis functions to use?
•  Problem of limiting complexity is shifted to
–  one of determining suitable value of regularization
coefficient l
45
Machine Learning
Srihari
Returning to LeToR Problem
• 
• 
• 
• 
Try:
Several Basis Functions
Quadratic Regularization
Express results as ERMS
–  rather than as squared error E(w*) or as Error
Rate with thresholded results
E RMS = 2E(w*) /N
46
Machine Learning
Srihari
Multiple Outputs
•  Several target variables t =(t1,..,tK) where K >1
•  Can be treated as multiple (K) independent
regression problems
–  Different basis functions for each component of t
•  More common solution: same set of basis
functions to model all components of target
vector y(x,w)=WTφ(x)
–  where y is a K-dim column vector, W is a M x K
matrix of weights and φ(x) is a M-dimensional
column vector with with elements φj(x)
47
Machine Learning
Srihari
Solution for Multiple Outputs
•  Set of observations t1,..,tN are combined
into a matrix T of size N x K such that the
nth row is given by tnT
•  Combine input vectors x1,..,xN into matrix X
•  Log-likelihood function is maximized
•  Solution is similar: WML=(ΦTΦ)-1ΦTT
48