Empirical Development of an Exponential Probabilistic Model for

Chapter1: Introduction
Chapter2: Overview of
Supervised Learning
2006.01.20
Supervised learning



Training data set: several features and
outcome
Build a learner based on training data sets
Predict the future unseen outcome from seen
features of data
An example of supervised learning
Email spam
Known
Normal
Emails
…
…
…
New emails
Spam
…
…
…
…
Learner …
Spam
Unknown
Normal emails
Input & Output


Input = predictor = independent variable
Output = response = dependent variable
Output Types

Quantitative >> regression


Ex) stock price, temperature, age
Qualitative >> classification

Ex) Yes/No,
Input Types



Quantitative
Qualitative
Ordered categorical

Ex) small, medium, big
Terminology

X : input




Y : quantitative output


Xj : j th component
X : matrix
xj : j th observed value
^
Y : prediction
G : qualitative output
General model

Given input X, output Y
unknown

Want to estimate the function f based on
known data set (training data)
Two simple methods

Linear model, linear regression

Nearest neighbor method
Linear model

Give a vector of input features X = (X1…Xp)
Assume the linear relationship:

Least squares standard:

min
-2
Classification example in two dimensions -1
Nearest neighbor method

Majority vote within the k nearest neighbors
new
K= 1: brown
K= 3: green
Classification example in two dimensions -2
Linear model vs. K-nearest neighbor

•
•
•
Linear model
#parameters: p
Stable, smooth
Low variance, high bias

•
•
•
K-nearest neighbor
#parameters: N/k
Unstable, wiggly
High variance, low bias
Each method has its own situations
for which it works best.
Misclassification curves
Enhanced Methods





Kernel methods using weights
Modifying the distance kernels
Locally weighted least squares
Expansion of inputs for arbitrarily complex
models
Projection & neural network
Statistical decision theory (1)

Given input X in Rp, output Y in R
Joint distribution: Pr(X,Y)
Looking for predicting function: f(X)

Squared error loss:


min EPE

Nearest-neighbor methods :
^
Statistical decision theory (2)

k-Nearest neighbor

Linear model

If N,k , k/N 0

But, the true function
might not be linear!

Insufficient samples!
Curse of dimensionality!
Statistical decision theory (3)

If
^


Robust
But, discontinuous in their derivatives
Statistical decision theory (4)

G : categorical output variable
L : Loss Function
^
EPE = E[L(G, G(X))]

Bayesian Classifier


References

Reading group on "elements of statistical
learning” – overview.ppt


Welcome to STAT 894 – SupervisedLearningOVERVIEW05.pdf


http://www.stat.ohio-state.edu/~goel/STATLEARN/
The Matrix Cookbook


http://sifaka.cs.uiuc.edu/taotao/stat.html
http://www2.imm.dtu.dk/pubdb/views/
edoc_download.php/3274/pdf/imm3274.pdf
A First Course in Probability
2.5 Local Methods in High Dimensions


With a reasonably large set of training data, we could always
approximate the theoretically optimal conditional expectation by
k-nearest-neighbor averaging.
The curse of dimensionality
 To capture 1% of data to form a local average, we must cover
63% of the range of each input variable.


The expected edge length =
p
r
All sample points are close to an edge of the sample.

Median distance from the origin to the closest data point:
1
d ( p, N )  (1 
2
1/ N
)1/ p
2.5 Local Methods in High Dimensions

Example 1-NN vs. Linear

1-NN
MSE ( x0 )  ET [ f ( x0 )  yˆ0 ]2
Variance
Sq. Bias
 ET [ yˆ0  ET ( yˆ0 )]  [ ET ( yˆ0 )  f ( x0 )]2


2
As p increases, MSE & bias tends to 1.0.
Linear model
EPE ( x0 )  E y0 | x0 ET ( y0  yˆ 0 ) 2
 Var ( y0 | x0 )  ET [ yˆ 0  ET yˆ 0 ]2  [ ET yˆ 0  ET y0 ]2
 Var ( y0 | x0 )  VarT ( yˆ 0 )  Bias 2 ( yˆ 0 )

= 0.increases linearly as a
Expecting on x0, the expected EPE
function of p.
By relying on rigid assumptions, the linear model has no bias at all and negligible
variance, while the error in 1-nearest neighbor is larger.
2.6 Statistical Models, Supervised Learning and
Function Approximation

Finding a useful approximation fˆ ( x) to
function f ( x) that underlies the predictive
relationship between the inputs and outputs.


Supervised learning: machine learning point of
view
Function approximation: mathematics and
statistics point of view
2.7 Structured Regression Models

Nearest-neighbor and other local methods
face problems in high dimensions.



They may be inappropriate even in low
dimensions.
Need for structured approaches.
Difficulty ofN the problem



RSS ( f )   ( yi  f ( xi )) 2
i 1
Infinitely many
solutions to minimizing RSS.
Unique solution comes from restrictions on f.
2.8 Classes of Restricted Estimators

Methods categorized by the nature of the
restrictions.

Roughness penalty and Bayesian methods


Kernel methods and local regression



Penalizing functions that too rapidly vary over small regions of
input space.
Explicitly specifying the nature of local neighborhood (kernel
function).
Need adaptation in high dimensions.
Basis functions and dictionary methods

Linear expansion of basis functions.
2.9 Model Selection and the Bias-Variance
Tradeoff

All models have a smoothing or complexity
parameter to be determined



Multiplier of the penalty term
Width of the kernel
Number of basis functions
Bias-Variance tradeoff
Essential with ε,
no way to reduce
To reduce one might increase
the other. Tradeoff!
Bias-Variance tradeoff in kNN
Model complexity
Low Bias
High Variance
Prediction Error
High Bias
Low Variance
Test error
Training error
Low
Model Complexity
High