Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20 Supervised learning Training data set: several features and outcome Build a learner based on training data sets Predict the future unseen outcome from seen features of data An example of supervised learning Email spam Known Normal Emails … … … New emails Spam … … … … Learner … Spam Unknown Normal emails Input & Output Input = predictor = independent variable Output = response = dependent variable Output Types Quantitative >> regression Ex) stock price, temperature, age Qualitative >> classification Ex) Yes/No, Input Types Quantitative Qualitative Ordered categorical Ex) small, medium, big Terminology X : input Y : quantitative output Xj : j th component X : matrix xj : j th observed value ^ Y : prediction G : qualitative output General model Given input X, output Y unknown Want to estimate the function f based on known data set (training data) Two simple methods Linear model, linear regression Nearest neighbor method Linear model Give a vector of input features X = (X1…Xp) Assume the linear relationship: Least squares standard: min -2 Classification example in two dimensions -1 Nearest neighbor method Majority vote within the k nearest neighbors new K= 1: brown K= 3: green Classification example in two dimensions -2 Linear model vs. K-nearest neighbor • • • Linear model #parameters: p Stable, smooth Low variance, high bias • • • K-nearest neighbor #parameters: N/k Unstable, wiggly High variance, low bias Each method has its own situations for which it works best. Misclassification curves Enhanced Methods Kernel methods using weights Modifying the distance kernels Locally weighted least squares Expansion of inputs for arbitrarily complex models Projection & neural network Statistical decision theory (1) Given input X in Rp, output Y in R Joint distribution: Pr(X,Y) Looking for predicting function: f(X) Squared error loss: min EPE Nearest-neighbor methods : ^ Statistical decision theory (2) k-Nearest neighbor Linear model If N,k , k/N 0 But, the true function might not be linear! Insufficient samples! Curse of dimensionality! Statistical decision theory (3) If ^ Robust But, discontinuous in their derivatives Statistical decision theory (4) G : categorical output variable L : Loss Function ^ EPE = E[L(G, G(X))] Bayesian Classifier References Reading group on "elements of statistical learning” – overview.ppt Welcome to STAT 894 – SupervisedLearningOVERVIEW05.pdf http://www.stat.ohio-state.edu/~goel/STATLEARN/ The Matrix Cookbook http://sifaka.cs.uiuc.edu/taotao/stat.html http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/3274/pdf/imm3274.pdf A First Course in Probability 2.5 Local Methods in High Dimensions With a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging. The curse of dimensionality To capture 1% of data to form a local average, we must cover 63% of the range of each input variable. The expected edge length = p r All sample points are close to an edge of the sample. Median distance from the origin to the closest data point: 1 d ( p, N ) (1 2 1/ N )1/ p 2.5 Local Methods in High Dimensions Example 1-NN vs. Linear 1-NN MSE ( x0 ) ET [ f ( x0 ) yˆ0 ]2 Variance Sq. Bias ET [ yˆ0 ET ( yˆ0 )] [ ET ( yˆ0 ) f ( x0 )]2 2 As p increases, MSE & bias tends to 1.0. Linear model EPE ( x0 ) E y0 | x0 ET ( y0 yˆ 0 ) 2 Var ( y0 | x0 ) ET [ yˆ 0 ET yˆ 0 ]2 [ ET yˆ 0 ET y0 ]2 Var ( y0 | x0 ) VarT ( yˆ 0 ) Bias 2 ( yˆ 0 ) = 0.increases linearly as a Expecting on x0, the expected EPE function of p. By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is larger. 2.6 Statistical Models, Supervised Learning and Function Approximation Finding a useful approximation fˆ ( x) to function f ( x) that underlies the predictive relationship between the inputs and outputs. Supervised learning: machine learning point of view Function approximation: mathematics and statistics point of view 2.7 Structured Regression Models Nearest-neighbor and other local methods face problems in high dimensions. They may be inappropriate even in low dimensions. Need for structured approaches. Difficulty ofN the problem RSS ( f ) ( yi f ( xi )) 2 i 1 Infinitely many solutions to minimizing RSS. Unique solution comes from restrictions on f. 2.8 Classes of Restricted Estimators Methods categorized by the nature of the restrictions. Roughness penalty and Bayesian methods Kernel methods and local regression Penalizing functions that too rapidly vary over small regions of input space. Explicitly specifying the nature of local neighborhood (kernel function). Need adaptation in high dimensions. Basis functions and dictionary methods Linear expansion of basis functions. 2.9 Model Selection and the Bias-Variance Tradeoff All models have a smoothing or complexity parameter to be determined Multiplier of the penalty term Width of the kernel Number of basis functions Bias-Variance tradeoff Essential with ε, no way to reduce To reduce one might increase the other. Tradeoff! Bias-Variance tradeoff in kNN Model complexity Low Bias High Variance Prediction Error High Bias Low Variance Test error Training error Low Model Complexity High
© Copyright 2026 Paperzz