MATH 829: Introduction to Data Mining and Analysis

MATH 829: Introduction to Data Mining and
Analysis
Introduction to statistical decision theory
Dominique Guillot
Departments of Mathematical Sciences
University of Delaware
March 4, 2016
1/7
Statistical decision theory
A framework for developing models. Suppose we want to
predict a random variable
Y
using a random vector
X.
2/7
Statistical decision theory
A framework for developing models. Suppose we want to
predict a random variable
Pr(X, Y )
(X, Y ).
Let
Y
using a random vector
X.
denote the joint probability distribution of
2/7
Statistical decision theory
A framework for developing models. Suppose we want to
predict a random variable
Pr(X, Y )
(X, Y ).
Let
Y
using a random vector
X.
denote the joint probability distribution of
We want to predict
Y
using some function
g(X).
2/7
Statistical decision theory
A framework for developing models. Suppose we want to
predict a random variable
Pr(X, Y )
(X, Y ).
Let
using a random vector
X.
denote the joint probability distribution of
We want to predict
We have a
Y
Y
using some function
loss function L(Y, f (X))
g(X).
to measure how good we
are doing, e.g., we used before
L(Y, f (X)) = (Y − g(X))2 .
when we worked with continuous random variables.
2/7
Statistical decision theory
A framework for developing models. Suppose we want to
predict a random variable
Pr(X, Y )
(X, Y ).
Let
using a random vector
X.
denote the joint probability distribution of
We want to predict
We have a
Y
Y
using some function
loss function L(Y, f (X))
g(X).
to measure how good we
are doing, e.g., we used before
L(Y, f (X)) = (Y − g(X))2 .
when we worked with continuous random variables.
How do we choose
g?
Optimal choice?
2/7
Statistical decision theory (cont.)
Natural to minimize the
expected prediction error:
EPE(f ) = E(L(Y, g(X))) =
Z
L(y, g(x)) Pr(dx, dy).
3/7
Statistical decision theory (cont.)
Natural to minimize the
expected prediction error:
EPE(f ) = E(L(Y, g(X))) =
Z
L(y, g(x)) Pr(dx, dy).
X ∈ Rp and Y ∈ R have a joint density
× R → [0, ∞), then we want to choose g to minimize
Z
(y − g(x))2 f (x, y) dxdy.
For example, if
f:
Rp
Rp ×R
3/7
Statistical decision theory (cont.)
Natural to minimize the
expected prediction error:
EPE(f ) = E(L(Y, g(X))) =
Z
L(y, g(x)) Pr(dx, dy).
X ∈ Rp and Y ∈ R have a joint density
× R → [0, ∞), then we want to choose g to minimize
Z
(y − g(x))2 f (x, y) dxdy.
For example, if
f:
Rp
Rp ×R
Recall the iterated expectations theorem:
Z1 , Z2 be random variables.
Then h(z2 ) = E(Z1 |Z2 = z2 ) =
Let
w.r.t. the conditional distribution of
We dene
Z1
Z 2 = z2 .
expected value of
Z1
given
E(Z1 |Z2 ) = h(Z2 ).
Now:
E(Z1 ) = E (E(Z1 |Z2 )) .
3/7
Statistical decision theory (cont.)
Suppose
L(Y, g(X)) = (Y − g(X))2 .
Using the iterated
expectations theorem:
EPE(f ) = E
Z
=
E[(Y − g(X))2 |X]
E[(Y − g(X))2 |X = x] · fX (x) dx.
4/7
Statistical decision theory (cont.)
Suppose
L(Y, g(X)) = (Y − g(X))2 .
Using the iterated
expectations theorem:
EPE(f ) = E
E[(Y − g(X))2 |X]
Z
=
Therefore, to minimize
E[(Y − g(X))2 |X = x] · fX (x) dx.
EPE(f ), it suces to choose
g(x) := argmin E[(Y − c)2 |X = x].
c∈R
4/7
Statistical decision theory (cont.)
Suppose
L(Y, g(X)) = (Y − g(X))2 .
Using the iterated
expectations theorem:
EPE(f ) = E
E[(Y − g(X))2 |X]
Z
=
Therefore, to minimize
E[(Y − g(X))2 |X = x] · fX (x) dx.
EPE(f ), it suces to choose
g(x) := argmin E[(Y − c)2 |X = x].
c∈R
Expanding:
E[(Y − c)2 |X = x] = E(Y 2 |X = x) − 2c · E(Y |X = x) + c2 .
4/7
Statistical decision theory (cont.)
Suppose
L(Y, g(X)) = (Y − g(X))2 .
Using the iterated
expectations theorem:
EPE(f ) = E
E[(Y − g(X))2 |X]
Z
=
Therefore, to minimize
E[(Y − g(X))2 |X = x] · fX (x) dx.
EPE(f ), it suces to choose
g(x) := argmin E[(Y − c)2 |X = x].
c∈R
Expanding:
E[(Y − c)2 |X = x] = E(Y 2 |X = x) − 2c · E(Y |X = x) + c2 .
The solution is
g(x) = E(Y |X = x).
4/7
Statistical decision theory (cont.)
Suppose
L(Y, g(X)) = (Y − g(X))2 .
Using the iterated
expectations theorem:
EPE(f ) = E
E[(Y − g(X))2 |X]
Z
=
Therefore, to minimize
E[(Y − g(X))2 |X = x] · fX (x) dx.
EPE(f ), it suces to choose
g(x) := argmin E[(Y − c)2 |X = x].
c∈R
Expanding:
E[(Y − c)2 |X = x] = E(Y 2 |X = x) − 2c · E(Y |X = x) + c2 .
The solution is
g(x) = E(Y |X = x).
Best prediction: average given
X = x.
4/7
Other loss functions
We saw that
g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x).
5/7
Other loss functions
We saw that
g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x).
Suppose instead we work with L(Y, g(X)) = |Y − g(X)|.
5/7
Other loss functions
We saw that
g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x).
Suppose instead we work with L(Y, g(X)) = |Y − g(X)|.
Applying the same argument, we obtain
g(x) = argmin E[|Y − c| | X = x].
c∈R
5/7
Other loss functions
We saw that
g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x).
Suppose instead we work with L(Y, g(X)) = |Y − g(X)|.
Applying the same argument, we obtain
g(x) = argmin E[|Y − c| | X = x].
c∈R
Problem: If
X
has density
fX ,
what is the min of
E(|X − c|)
over
c?
5/7
Other loss functions
We saw that
g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x).
Suppose instead we work with L(Y, g(X)) = |Y − g(X)|.
Applying the same argument, we obtain
g(x) = argmin E[|Y − c| | X = x].
c∈R
Problem: If
X
has density
fX ,
what is the min of
E(|X − c|)
over
c?
Z
E(|X − c|) =
|x − c| fX (x) dx
Z
c
Z
(c − x) fX (x)dx +
=
−∞
∞
(x − c) fX (x)dx.
c
5/7
Other loss functions
We saw that
g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x).
Suppose instead we work with L(Y, g(X)) = |Y − g(X)|.
Applying the same argument, we obtain
g(x) = argmin E[|Y − c| | X = x].
c∈R
Problem: If
X
has density
fX ,
what is the min of
E(|X − c|)
over
c?
Z
E(|X − c|) =
|x − c| fX (x) dx
Z
c
Z
∞
(c − x) fX (x)dx +
=
−∞
(x − c) fX (x)dx.
c
Now, dierentiate
d
d
E(|X − c|) =
dc
dc
Z
c
d
(c − x) fX (x)dx +
dc
−∞
Z
∞
(x − c) fX (x)dx
c
5/7
Other loss functions (cont.)
Recall:
d
dx
Z
x
h(t) dt = h(x).
a
Here, we have
Z c
Z c
Z ∞
Z ∞
d
d
c
fX (x)dx −
xfX (x)dx +
xfX (x)dx − c
fX (x)dx
dc −∞
dc c
−∞
c
Z c
Z ∞
=
fX (x)dx −
fX (x)dx.
−∞
c
Check! (Use product rule and
R∞
c
=
R∞
Rc
−∞ − −∞ .)
6/7
Other loss functions (cont.)
Recall:
d
dx
Z
x
h(t) dt = h(x).
a
Here, we have
Z c
Z c
Z ∞
Z ∞
d
d
c
fX (x)dx −
xfX (x)dx +
xfX (x)dx − c
fX (x)dx
dc −∞
dc c
−∞
c
Z c
Z ∞
=
fX (x)dx −
fX (x)dx.
−∞
c
Check! (Use product rule and
Conclusion:
d
dc E(|X
R∞
− c|) = 0
the minimum of obtained when
c
=
R∞
Rc
−∞ − −∞ .)
c is such that FX (c) = 1/2.
c = median(X).
i
So
6/7
Other loss functions (cont.)
Recall:
d
dx
Z
x
h(t) dt = h(x).
a
Here, we have
Z c
Z c
Z ∞
Z ∞
d
d
c
fX (x)dx −
xfX (x)dx +
xfX (x)dx − c
fX (x)dx
dc −∞
dc c
−∞
c
Z c
Z ∞
=
fX (x)dx −
fX (x)dx.
−∞
c
Check! (Use product rule and
Conclusion:
d
dc E(|X
R∞
− c|) = 0
the minimum of obtained when
c
=
R∞
Rc
−∞ − −∞ .)
c is such that FX (c) = 1/2.
c = median(X).
i
So
Going back to our problem:
g(x) = argmin E[|Y − c| | X = x] = median(Y |X = x).
c∈R
6/7
Back to nearest neighbors
We saw that
E(Y |X = x)
minimize the expected loss with the loss
is the squared error.
7/7
Back to nearest neighbors
We saw that
E(Y |X = x)
minimize the expected loss with the loss
is the squared error.
In practice, we don't know the joint distribution of
X
and
Y.
7/7
Back to nearest neighbors
We saw that
E(Y |X = x)
minimize the expected loss with the loss
is the squared error.
In practice, we don't know the joint distribution of
X
and
Y.
The nearest neighbors can be seen as an attempt to approximate
E(Y |X = x)
by
1
Approximating the expected value by averaging sample data.
2
Replacing |X
= x
by |X
or only a few samples where
≈ x (since
X = x).
there is generally no
7/7
Back to nearest neighbors
We saw that
E(Y |X = x)
minimize the expected loss with the loss
is the squared error.
In practice, we don't know the joint distribution of
X
and
Y.
The nearest neighbors can be seen as an attempt to approximate
E(Y |X = x)
by
1
Approximating the expected value by averaging sample data.
2
Replacing |X
= x
by |X
or only a few samples where
≈ x (since
X = x).
there is generally no
There is thus strong theoretical motivations for working with
nearest neighbors.
7/7
Back to nearest neighbors
We saw that
E(Y |X = x)
minimize the expected loss with the loss
is the squared error.
In practice, we don't know the joint distribution of
X
and
Y.
The nearest neighbors can be seen as an attempt to approximate
E(Y |X = x)
by
1
Approximating the expected value by averaging sample data.
2
Replacing |X
= x
by |X
or only a few samples where
≈ x (since
X = x).
there is generally no
There is thus strong theoretical motivations for working with
nearest neighbors.
Note: If one is interested to control the absolute error, then one
could compute the median of the neighbors instead of the mean.
7/7