Lecture Notes 15
Prediction
Chapters 13, 22, 20.4.
1
Introduction
Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give
an introduction.
We observe training data Z1 , . . . , Zn ∼ P where Zi = (Xi , Yi ) where Xi ∈ Rd . Given a
new pair Z = (X, Y ) we want to predict Y from X. There are two common versions:
1. Y ∈ {0, 1}. This is called classification, or discrimination, or pattern recognition.
(More generally, Y can be discrete.)
2. Y ∈ R. This is called regression.
For classification we will use the following loss function. Let h(x) be or prediction of Y
when X = x. Thus h(x) ∈ {0, 1}. The function h is called a classifier. The classification
loss is I(Y 6= h(X)) and the the classification risk is
R(h) = P(Y 6= h(X)) = E(I(Y 6= h(X))).
For regression, suppose our prediction of Y when X = x is g(x). We will use the squared
error prediction loss (Y − g(X))2 and the risk is
R(g) = E(Y − g(X))2 .
Notation: We write Xi = (Xi (1), . . . , Xi (d)). Hence, Xi (j) is the j th feature for the ith
observation.
2
The Optimal Regression Function
Suppose for the moment that we know the joint distribution p(x, y). Then we can find the
best regression function.
Theorem 1 R(g) is minimized by
Z
m(x) = E(Y |X = x) =
1
y p(y|x)dy.
Proof. Let g(x) be any function of x. Then
R(g) = E(Y − g(X))2 = E(Y − m(X) + m(X) − g(X))2
= E(Y − m(X))2 + E(m(X) − g(X))2 + 2E((Y − m(X))(m(X) − g(X)))
≥ E(Y − m(X))2 + 2E((Y − m(X))(m(X) − g(X)))
!
2
= E(Y − m(X)) + 2EE (Y − m(X))(m(X) − g(X)) X
!
= E(Y − m(X))2 + 2E (E(Y |X) − m(X))(m(X) − g(X))
!
= E(Y − m(X))2 + 2E (m(X) − m(X))(m(X) − g(X))
= E(Y − m(X))2 = R(m).
Of course, we do not know m(x) so we need to find a way to predict Y based on the
training data.
3
Linear Regression
The simplest approach is to use a parametric model. In particular, the linear regression
model assumes that m(x) is a linear
P function of x = (x(1), . . . , x(d)). That is, we use a
predictor of the form m(x) = β0 + j β(j)x(j). If we define x(1) = 1 then we can write this
more simply as m(x) = β T x. In what follows, I always assume that the intercept has been
absorbed this way.
3.1
A Bad Approach: Assume the True Model is Linear
On approach is to assume that the the true regression function m(x) is linear. Hence,
m(x) = β T x and we can then write
Yi = β T Xi + i
where E[i ] = 0. This model is certainly wrong, so let’s proceed with caution.
The least squares estimator βb is defined to be the β that minimizes
n
X
(Yi − XiT β)2 .
i=1
Theorem 2 Let X be the n×d matrix with X(i, j) = Xi (j) and let Y = (Y1 , . . . , Yn ). Suppose
that XT X is invertible. Then the least squares estimator is
βb = (XT X)−1 XT Y.
2
Theorem 3 Suppose that the linear model is correct. Also, suppose that Var(i ) = σ 2 and
that Xi is fixed. Then βb is unbiased and has covariance σ 2 (XT X)−1 . Under some regularity
conditions, βb is asymptotically Normally distributed. If the 0i s are N (0, σ 2 ) then, βb has a
Normal distribution.
Continuing with the assumption that the linear model is correct, we can also say the
following. A consistent estimator of σ 2 is
σ
b2 =
and
√
RSS
n−p
n(βbj − βj )
sj
N (0, 1)
where the standard error sj is the j th diagonal element of σ
b2 XT X. To test H0 : βbj = 0 versus
H1 : βbj 6= 0 we reject if |βbj |/sj > zα/2 . An approximate 1 − α confidence interval for βj is
βbj ± zα/2 sj .
Theorem 4 Suppose that the linear model is correct and that 1 , . . . , n ∼ N (0, σ 2 ). Then
the least squares estimator is the maximum likelihood estimator.
3.2
A Better Approach: Assume the True Model is Not Linear
Now we switch to more reasonable assumptions. We assume that the linear model is wrong
and that X is random. The least squares estimator still has good properties. Let β∗ minimize
R(β) = E(Y − X T β)2 .
We call `∗ (x) = xT β∗ the best linear predictor. It is also called the projection parameter.
Lemma 5 The value of β that minimizes R(β) is
β = Λ−1 α
where Λ = E[Xi XiT ] is a d × d matrix, and α = (α(1), . . . , α(d) where α(j) = E[Yi Xi (j)].
The plug-in estimator β is the least squares estimator
b −1 α
βb = Λ
b = (XT X)−1 XT Y
where
X
b= 1
Xi XiT ,
Λ
n i
α
b=
3
1X
Yi X i .
n i
In other words, the least-squares estimator is the plug-in estimator. We can write β =
P
P
b →
g(Λ, α). By the law of large numbers, Λ
Λ and α
b → α. If Λ is invertible, then g is
continous and so, by the continous mapping theorem,
P
βb → β.
(This all assumes d is fixed. If d increases with n then we need different theory that is
discussed in 10/36-702 and 36-707.) By the delta-method,
√
n(βb − β)
N (0, Γ)
for some Γ. There is a convenient, consistent estimator of Γ, called the sandwich estimator
given by
b=Λ
b −1 M Λ
b −1
Γ
where
n
1X 2
M=
r Xi XiT
n i=1 i
b Hence, an asymptotic confidence interval for β(j) is
where ri = Yi − XiT β.
q
zα/2 b
b
√
β(j) ±
Γ(j, j).
n
Another way to construct a confidence set is to use the bootstrap. In particular, we use
the pairs bootstrap which treats each pair (Xi , Yi ) as one observation. The confidence set is
)
(
t
b ∞ ≤ √α
Cn = β : ||β − β||
n
where tα is defined by
√
P
b ∞ > tα Z1 , . . . , Zn = α
n||βb∗ − β||
where Zi = (Xi , Yi ). In practice, we approximate this with
√
P
B
1 X √ b∗ b
∗
b
b
n||β − β||∞ > tα Z1 , . . . , Zn ≈
I n||βj − β||∞ > tα .
B j=1
b
There is another version of the bootstrap which bootstraps the residuals b
i = Yi − XiT β.
However, this version is only valid if the linear model is correct.
4
3.3
The Geometry of Least Squares
The fitted values or predicted values are Yb = (Yb1 , . . . , Ybn )T where
b
Ybi = XiT β.
Hence,
Yb = Xβb = HY
where
H = X(XT X)−1 XT
is called the hat matrix.
Theorem 6 The matrix H is symmetric and idempotent: H 2 = H. Moreover, HY is the
projection of Y onto the column space of X.
4
Nonparametric Regression
Suppose we want to estimate m(x) where we only assume that m is a smooth function. The
kernel regression estimator is
X
m(x)
b
=
Yi wi (x)
i
where
i ||
K ||x−X
h
.
wi (x) = P
||x−Xj ||
K
j
h
Here K is a kernel and h is a bandwidth. The properties are simialr to that of kernel density
estimation. The properties of m
b are similar to the kernel density estimator and are discussed
in more detail in the 36-707 and in 10-702. An example is shown in Figure 1.
5
1.0
●
●
●
●
0.5
●
●●
●
●
y
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
−0.5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1.0
●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
● ●
● ●
●
●
●
●
−1.0
●
−0.5
0.0
0.5
1.0
x
Figure 1: A kernel regression estimator.
5
Classification
The best classifier is the so-called Bayes classifier defined by:
hB (x) = I(m(x) ≥ 1/2)
where m(x) = E(Y |X = x). (This has nothing to do with Bayesian inference.)
Theorem 7 For any h, R(h) ≥ R(hB ).
Proof. For any h,
R(h) − R(hB ) = P(Y 6= h(X)) − P(Y 6= hB (X))
Z
Z
=
P(Y 6= h(x)|X = x)p(x)dx − P(Y 6= hB (x)|X = x)p(x)dx
Z
=
(P(Y 6= h(x)|X = x) − P(Y 6= hB (x)|X = x)) p(x)dx.
We will show that
P(Y 6= h(x)|X = x) − P(Y 6= hB (x)|X = x) ≥ 0
6
for all x. Now
P(Y 6= h(x)|X = x) − P(Y 6= hB (x)|X = x)
!
h(x)P(Y 6= 1|X = x) + (1 − h(x))P(Y 6= 0|X = x)
=
!
−
hB (x)P(Y 6= 1|X = x) + (1 − hB (x))P(Y 6= 0|X = x)
= (h(x)(1 − m(x)) + (1 − h(x))m(x))
− (hB (x)(1 − m(x)) + (1 − hB (x))m(x))
= 2(m(x) − 1/2)(hB (x) − h(x)) ≥ 0
since hB (x) = 1 if and only if m(x) ≥ 1/2. The most direct approach to classification is empirical risk minimization (ERM). We
start with a set of classifiers H. Each h ∈ H is a function h : x → {0, 1}. The training error
or empirical risk is
n
1X
b
R(h) =
I(Yi 6= h(Xi )).
n i=1
b
We choose b
h to minimize R:
b
b
h = argminh∈H R(h).
For example, a linear classifier has the form hβ (x) = I(β T x ≥ 0). The set of linear
classifiers is H = {hβ : β ∈ Rp }.
Theorem 8 Suppose that H has VC dimension d < ∞. Let b
h be the empirical risk minimizer and let
h∗ = argminh∈H R(h)
be the best classifier in H. Then, for any > 0,
2
P(R(b
h) > R(h∗ ) + 2) ≤ c2 nd e−nc2 for some constnts c1 and c2 .
Proof. Recall that
2
b
P(sup |R(h)
− R(h)| > ) ≤ c2 nd e−nc2 .
h∈H
b
But when suph∈H |R(h)
− R(h)| ≤ we have
bb
b ∗ ) + ≤ R(h∗ ) + 2.
R(b
h) ≤ R(
h) + ≤ R(h
7
b
Empirical risk minimization is difficult because R(h)
is not a smooth function. Thus,
we often use other approaches. One idea is to use a surrogate loss function. To expain this
idea, it will be convenient to relabel the Yi ’s as being +1 or -1. Many classifiers then take
the form
h(x) = sign(f (x))
for some f (x). For example, linear classifiers have f (x) = xT β. Th classification loss is then
L(Y, f, X) = I(Y f (X) < 0)
since an error occurs if and only if Y and f (X) have different signs. An example of surrogate
loss is the hinge function
(1 − Y f (X))+ .
Instead of minimizing classification loss, we minimize
X
(1 − Yi f (Xi ))+ .
i
The resulting classifier is called a support vector machine.
Another approach to classification is plug-in clasification. We replace the Bayes rule
hB = I(m(x) ≥ 1/2) with
b
h(x) = I(m(x)
b
≥ 1/2)
where m
b is an estimate of the regression function. The estimate m
b can be parametric or
nonparametric.
A common parametric estimator is logistic regression. Here, we assume that
T
ex β
.
m(x; β) =
1 + exT β
Since Yi is Bernoulli, the likeihood is
L(β) =
n
Y
m(Xi ; β)Yi (1 − m(Xi ; β))1−Yi .
i=1
We compute the mle βb numerically. See Section 12.3 of the text.
What is the relationship between classification and regression? Generally speaking, classification is easier. This follows from the next result.
Theorem 9 Let m(x) = E(Y |X = x) and let hm (x) = I(m(x) ≥ 1/2) be the Bayes rule.
Let g be any function and let hg (x) = I(g(x) ≥ 1/2). Then
sZ
|g(x) − m(x)|2 dP (x).
R(hg ) − R(hm ) ≤ 2
8
Proof. We showed earlier that
Z
R(hg ) − R(hm ) = [P(Y 6= hg (x)|X = x) − P(Y 6= hm (x)|X = x)] dP (x)
and that
P(Y 6= hg (x)|X = x) − P(Y 6= hm (x)|X = x) = 2(m(x) − 1/2)(hm (x) − hg (x)).
Now
2(m(x) − 1/2)(hm (x) − hg (x)) = 2|m(x) − 1/2| I(hm (x) 6= hg (x)) ≤ 2|m(x) − g(x)|
since hm (x) 6= hg (x) implies that |m(x) − 1/2| ≤ |m(x) − g(x)|. Hence,
Z
R(hg ) − R(hm ) = 2 |m(x) − 1/2|I(hm (x) 6= hg (x))dP (x)
Z
≤ 2 |m(x) − g(x)|dP (x)
sZ
≤ 2
|g(x) − m(x)|2 dP (x)
where the last setp follows from the Cauchy-Schwartz inequality. R
Hence, if we have an estimator m
b such that |m(x)
b
− m(x)|2 dP (x) is small, then the
excess classification risk is also small. But the reverse is not true.
9
© Copyright 2026 Paperzz