MATH 829: Introduction to Data Mining and Analysis Introduction to statistical decision theory Dominique Guillot Departments of Mathematical Sciences University of Delaware March 4, 2016 1/7 Statistical decision theory A framework for developing models. Suppose we want to predict a random variable Y using a random vector X. 2/7 Statistical decision theory A framework for developing models. Suppose we want to predict a random variable Pr(X, Y ) (X, Y ). Let Y using a random vector X. denote the joint probability distribution of 2/7 Statistical decision theory A framework for developing models. Suppose we want to predict a random variable Pr(X, Y ) (X, Y ). Let Y using a random vector X. denote the joint probability distribution of We want to predict Y using some function g(X). 2/7 Statistical decision theory A framework for developing models. Suppose we want to predict a random variable Pr(X, Y ) (X, Y ). Let using a random vector X. denote the joint probability distribution of We want to predict We have a Y Y using some function loss function L(Y, f (X)) g(X). to measure how good we are doing, e.g., we used before L(Y, f (X)) = (Y − g(X))2 . when we worked with continuous random variables. 2/7 Statistical decision theory A framework for developing models. Suppose we want to predict a random variable Pr(X, Y ) (X, Y ). Let using a random vector X. denote the joint probability distribution of We want to predict We have a Y Y using some function loss function L(Y, f (X)) g(X). to measure how good we are doing, e.g., we used before L(Y, f (X)) = (Y − g(X))2 . when we worked with continuous random variables. How do we choose g? Optimal choice? 2/7 Statistical decision theory (cont.) Natural to minimize the expected prediction error: EPE(f ) = E(L(Y, g(X))) = Z L(y, g(x)) Pr(dx, dy). 3/7 Statistical decision theory (cont.) Natural to minimize the expected prediction error: EPE(f ) = E(L(Y, g(X))) = Z L(y, g(x)) Pr(dx, dy). X ∈ Rp and Y ∈ R have a joint density × R → [0, ∞), then we want to choose g to minimize Z (y − g(x))2 f (x, y) dxdy. For example, if f: Rp Rp ×R 3/7 Statistical decision theory (cont.) Natural to minimize the expected prediction error: EPE(f ) = E(L(Y, g(X))) = Z L(y, g(x)) Pr(dx, dy). X ∈ Rp and Y ∈ R have a joint density × R → [0, ∞), then we want to choose g to minimize Z (y − g(x))2 f (x, y) dxdy. For example, if f: Rp Rp ×R Recall the iterated expectations theorem: Z1 , Z2 be random variables. Then h(z2 ) = E(Z1 |Z2 = z2 ) = Let w.r.t. the conditional distribution of We dene Z1 Z 2 = z2 . expected value of Z1 given E(Z1 |Z2 ) = h(Z2 ). Now: E(Z1 ) = E (E(Z1 |Z2 )) . 3/7 Statistical decision theory (cont.) Suppose L(Y, g(X)) = (Y − g(X))2 . Using the iterated expectations theorem: EPE(f ) = E Z = E[(Y − g(X))2 |X] E[(Y − g(X))2 |X = x] · fX (x) dx. 4/7 Statistical decision theory (cont.) Suppose L(Y, g(X)) = (Y − g(X))2 . Using the iterated expectations theorem: EPE(f ) = E E[(Y − g(X))2 |X] Z = Therefore, to minimize E[(Y − g(X))2 |X = x] · fX (x) dx. EPE(f ), it suces to choose g(x) := argmin E[(Y − c)2 |X = x]. c∈R 4/7 Statistical decision theory (cont.) Suppose L(Y, g(X)) = (Y − g(X))2 . Using the iterated expectations theorem: EPE(f ) = E E[(Y − g(X))2 |X] Z = Therefore, to minimize E[(Y − g(X))2 |X = x] · fX (x) dx. EPE(f ), it suces to choose g(x) := argmin E[(Y − c)2 |X = x]. c∈R Expanding: E[(Y − c)2 |X = x] = E(Y 2 |X = x) − 2c · E(Y |X = x) + c2 . 4/7 Statistical decision theory (cont.) Suppose L(Y, g(X)) = (Y − g(X))2 . Using the iterated expectations theorem: EPE(f ) = E E[(Y − g(X))2 |X] Z = Therefore, to minimize E[(Y − g(X))2 |X = x] · fX (x) dx. EPE(f ), it suces to choose g(x) := argmin E[(Y − c)2 |X = x]. c∈R Expanding: E[(Y − c)2 |X = x] = E(Y 2 |X = x) − 2c · E(Y |X = x) + c2 . The solution is g(x) = E(Y |X = x). 4/7 Statistical decision theory (cont.) Suppose L(Y, g(X)) = (Y − g(X))2 . Using the iterated expectations theorem: EPE(f ) = E E[(Y − g(X))2 |X] Z = Therefore, to minimize E[(Y − g(X))2 |X = x] · fX (x) dx. EPE(f ), it suces to choose g(x) := argmin E[(Y − c)2 |X = x]. c∈R Expanding: E[(Y − c)2 |X = x] = E(Y 2 |X = x) − 2c · E(Y |X = x) + c2 . The solution is g(x) = E(Y |X = x). Best prediction: average given X = x. 4/7 Other loss functions We saw that g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x). 5/7 Other loss functions We saw that g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x). Suppose instead we work with L(Y, g(X)) = |Y − g(X)|. 5/7 Other loss functions We saw that g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x). Suppose instead we work with L(Y, g(X)) = |Y − g(X)|. Applying the same argument, we obtain g(x) = argmin E[|Y − c| | X = x]. c∈R 5/7 Other loss functions We saw that g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x). Suppose instead we work with L(Y, g(X)) = |Y − g(X)|. Applying the same argument, we obtain g(x) = argmin E[|Y − c| | X = x]. c∈R Problem: If X has density fX , what is the min of E(|X − c|) over c? 5/7 Other loss functions We saw that g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x). Suppose instead we work with L(Y, g(X)) = |Y − g(X)|. Applying the same argument, we obtain g(x) = argmin E[|Y − c| | X = x]. c∈R Problem: If X has density fX , what is the min of E(|X − c|) over c? Z E(|X − c|) = |x − c| fX (x) dx Z c Z (c − x) fX (x)dx + = −∞ ∞ (x − c) fX (x)dx. c 5/7 Other loss functions We saw that g(x) := argminc∈R E[(Y − c)2 |X = x] = E(Y |X = x). Suppose instead we work with L(Y, g(X)) = |Y − g(X)|. Applying the same argument, we obtain g(x) = argmin E[|Y − c| | X = x]. c∈R Problem: If X has density fX , what is the min of E(|X − c|) over c? Z E(|X − c|) = |x − c| fX (x) dx Z c Z ∞ (c − x) fX (x)dx + = −∞ (x − c) fX (x)dx. c Now, dierentiate d d E(|X − c|) = dc dc Z c d (c − x) fX (x)dx + dc −∞ Z ∞ (x − c) fX (x)dx c 5/7 Other loss functions (cont.) Recall: d dx Z x h(t) dt = h(x). a Here, we have Z c Z c Z ∞ Z ∞ d d c fX (x)dx − xfX (x)dx + xfX (x)dx − c fX (x)dx dc −∞ dc c −∞ c Z c Z ∞ = fX (x)dx − fX (x)dx. −∞ c Check! (Use product rule and R∞ c = R∞ Rc −∞ − −∞ .) 6/7 Other loss functions (cont.) Recall: d dx Z x h(t) dt = h(x). a Here, we have Z c Z c Z ∞ Z ∞ d d c fX (x)dx − xfX (x)dx + xfX (x)dx − c fX (x)dx dc −∞ dc c −∞ c Z c Z ∞ = fX (x)dx − fX (x)dx. −∞ c Check! (Use product rule and Conclusion: d dc E(|X R∞ − c|) = 0 the minimum of obtained when c = R∞ Rc −∞ − −∞ .) c is such that FX (c) = 1/2. c = median(X). i So 6/7 Other loss functions (cont.) Recall: d dx Z x h(t) dt = h(x). a Here, we have Z c Z c Z ∞ Z ∞ d d c fX (x)dx − xfX (x)dx + xfX (x)dx − c fX (x)dx dc −∞ dc c −∞ c Z c Z ∞ = fX (x)dx − fX (x)dx. −∞ c Check! (Use product rule and Conclusion: d dc E(|X R∞ − c|) = 0 the minimum of obtained when c = R∞ Rc −∞ − −∞ .) c is such that FX (c) = 1/2. c = median(X). i So Going back to our problem: g(x) = argmin E[|Y − c| | X = x] = median(Y |X = x). c∈R 6/7 Back to nearest neighbors We saw that E(Y |X = x) minimize the expected loss with the loss is the squared error. 7/7 Back to nearest neighbors We saw that E(Y |X = x) minimize the expected loss with the loss is the squared error. In practice, we don't know the joint distribution of X and Y. 7/7 Back to nearest neighbors We saw that E(Y |X = x) minimize the expected loss with the loss is the squared error. In practice, we don't know the joint distribution of X and Y. The nearest neighbors can be seen as an attempt to approximate E(Y |X = x) by 1 Approximating the expected value by averaging sample data. 2 Replacing |X = x by |X or only a few samples where ≈ x (since X = x). there is generally no 7/7 Back to nearest neighbors We saw that E(Y |X = x) minimize the expected loss with the loss is the squared error. In practice, we don't know the joint distribution of X and Y. The nearest neighbors can be seen as an attempt to approximate E(Y |X = x) by 1 Approximating the expected value by averaging sample data. 2 Replacing |X = x by |X or only a few samples where ≈ x (since X = x). there is generally no There is thus strong theoretical motivations for working with nearest neighbors. 7/7 Back to nearest neighbors We saw that E(Y |X = x) minimize the expected loss with the loss is the squared error. In practice, we don't know the joint distribution of X and Y. The nearest neighbors can be seen as an attempt to approximate E(Y |X = x) by 1 Approximating the expected value by averaging sample data. 2 Replacing |X = x by |X or only a few samples where ≈ x (since X = x). there is generally no There is thus strong theoretical motivations for working with nearest neighbors. Note: If one is interested to control the absolute error, then one could compute the median of the neighbors instead of the mean. 7/7
© Copyright 2026 Paperzz