Generalization in Supervised Machine Learning

Generalization in
Supervised
Machine Learning
BLiNQ MEDIA
Praneeth Vepakomma
Senior Data Scientist
Hypothetical Knapsack of Coins:
Copper and Gold Coins
Total number of coins is fixed and is a large sample.
Capture-Recapture
What is the proportion of Gold coins?
Copper and Gold Coins
Total number of coins is variable and is a large sample.
Capture-Recapture
What is the proportion of Gold coins?
BASIC ML/STAT TERMINOLOGY:
Xn´ p : Features, Predictors
Yn´k : Labels, Responses
190 Years after Gauss, the core problem of
prediction remains an active problem :
Xn´ p : Features, Predictors
Yn´k : Labels, Responses
Then:
Now:
|| Y - Xb ||2
n
E(Yi ) = g (å fi (Xi ))
-1
i=1
190 Years after Gauss, the core problem of
prediction remains an active problem :
Xn´ p : Features, Predictors
Yn´k : Labels, Responses
Find a mapping♯ from the features:
q is a list of parameters, required to represent the function
#Approximation
What is Supervised Learning?
Existing
Features
Loss Function
f : X ®Y
Known
Labels
Assumptions
Unavailable
Features
Loss Function
Unknown
Labels
Evaluating the Learned
Function:
 Loss Function quantifies the error in the approximation.
 Learn a mapping by optimizing the loss.
L( fq (X),Y)
Example:
( fq (X) - Y)
2
Predictions with varying parameters:
Predictions with varying parameters:
How do we generalize?
Generalization and
Predictability
X,Y ~ D
Empirical Risk Minimization:
1 n
L( fq (Xi . ),Yi . )
å
n i =1
 Empirical Risk is the average (expected) loss on seen
data.
 True Risk is the expected risk on the process
generating the X,Y pairs.
True Risk Minimization:
PARAMETRIC CHARACTERIZATION
OF THE MAPPING :
 2d-Linear function: Slope, Intercept
 Cubic Spline: Number of knots, Location of Knots
 Nearest-Neighbor regression: Number of neighbors
 Lasso: L1-L2 Weights
 Support Vector Machines: Kernel width, Margin Length
 Random Forests: Resampling sample size
 Long list of available Supervised Learning
Techniques.
Most of the techniques have tuning parameters.
 We can minimize out-of-sample performance by
tuning the technique with optimal parameters.
Tuning can be performed by cross-validation over
a discrete grid of parameter combinations.
CURSE OF DIMENSIONALITYFlat World-10D World:
CURSE OF DIMENSIONALITYFlat World-10D World:
CURSE OF DIMENSIONALITYFlat World-10D World:
CURSE OF DIMENSIONALITYLet us validate:
Structural Risk Minimization
via Regularization:
1 n
L( f Linear (Xi ),Yi ) + l å || bi ||1
å
n i =1
Geneology?
Brief Description
Technology Overview
Hiring (What we’re looking for)
http://blinqmedia.com/contact/job-openings/
Lets work with Abalone
Thank You!