Stats - TerpConnect

An entirely too
expansive talk on
statistics
BY NICK THIEME
Me
•3rd PhD student in Hector Corrada Bravo’s Biomedical Data Science Lab
•I work on algorithm development/theory for bioinformatics
•I’m here because I took Storytelling with Data Visualization (taught by Nick Diakopoulos) with
Kushal and Snigda and they were nice enough to invite me to come speak
•I’m also here because I love statistics communication
Reminder of notation
•Vector: 𝑥 = 𝑥1 , 𝑥2 , . . . , 𝑥𝑛
•Matrix: 𝑋
𝑛,𝑝
𝑥11
⋮
=
𝑥𝑛1
⋯ 𝑥1𝑝
⋱
⋮
⋯ 𝑥𝑛𝑝
= 𝑥1 , . . . , 𝑥𝑝
•Vector norm
• Captures “length” of vector
• “p-norms” are very common:
1/ p
 n

|| x || p    | xi | p 
 i1

• When p=2, you get the usual notion of distance, “as the crow flies”, p=1
gives “taxicab” distance
• p=2, ||y||=5
• p=1, ||y||=7
Probability basics
•A lot of ways to approach a random variable
• For us, it’s quantity, X, that takes on values with fixed probabilities
•Random variables are characterized by their probability distributions
• A function f  x   P  X  x  with two conditions:
• Probability can’t be negative
P X  x  0
• Probability of all events sum to 1
n
 P X  x  1
i 1
i
• These functions have parameters which describe their behavior
• In the coin flip, the parameter is the probability of getting heads
•Discrete random variables
• Can take on a “countable” number of values. Rolling a die, flipping a coin
• Can take on an “uncountable” number of values. Exact height
Probability basics (2)
•Random variables are often related:
• X = whether I’ll be up late tonight, {yes, no}, P  X  yes   .7
• Y = whether I’ll wake up on time tomorrow, {yes, no}, P Y  yes   .2
•We can talk about their joint distribution (what’s the probability of these things happenings
together)
P  X  yes, Y  yes   .1
P  X  yes, Y  no   .6
P  X  no, Y  yes   .1
P  X  no, Y  no   .2
•Variables are “independent” if P  X  x  P Y  y   P  X  x,Y  y 
• both in my life and above, X and Y are not independent
Probability basics (3)
•We can also talk about “conditional distributions” (what’s the probability of one things
happenings, given that we know another thing has already happened)
P Y  yes | X  yes  
P Y  yes, X  yes 
P  X  yes 
• What’s the probability I will wake up on time tomorrow, given that I know I stayed up late tonight?
• .1/.7=.142
• Conditional distributions are very important and will come up when we P  X  yes, Y  yes   .1
talk about regression
P  X  yes, Y  no   .6
P  X  no, Y  yes   .1
P  X  no, Y  no   .2
P  X  yes   .7
Bayes Theorem
•A formula that allows us to “flip” conditional probabilities
P Y | X  P  X 
P X |Y  
P Y 
•May not seem like much
• Let D be a measured dataset, and H be a hypothesis we have about the data
P H | D 
PD | H  PH 
P D
• Gives a way to turn the probability of the data given the hypothesis (which is not exactly what scientists want
to talk about) into the probability of the hypothesis given the data (which seems closer to what we want to
talk about)
• That’s the nice idea behind Bayesian analysis. The problem is we have to specify what P(H) and P(D) are
• Lots of heated arguments because of this theorem. What must the amazing derivation be?
The magnificent derivation of Bayes
theorem
P X |Y  
P  X , Y  P Y | X  P  X 

P Y 
P Y 
Some discrete probability distributions
•Bernoulli distribution
• If you’re considering a single event with two possible outcomes and you know the probability of each outcome, it’s a
Bernoulli variable
• X = Will I wake up on time tomorrow?
 yes p=.3
f  x, p   
 no p=.7
•Binomial distribution
• If you’re considering a series of independent Bernoulli trials where the probability of a “success” are all the same.
• Two parameters
• Number of trials and probability of success
• X= Out of the next five days, how many times will I wake up on time, {0,1,2,3,4,5}
n x
n x
f  x, n, p     p 1  p 
k 
• Probability I’m up on time three of the next five days
10  .33  .7 2  .1323
Another discrete probability distribution
• Poisson distribution
• If you’re considering how many times something
will happen in a fixed amount of time, you know
how often that thing occurs (on average), and the
occurrences are independent
• X = How many great songs will the wedding DJ
play in one hour if we know he plays five great
songs in an hour
f  x,   
 xe
x!
Continuous probability distributions
•Normal distribution
• Useful in a lot of physical and social processes, partly because of central limit theorem (CLT)
• “Everyone believes [errors are normally distributed]: experimentalists believing that it is a mathematical
theorem, mathematicians believing that it is an empirical fact.” — Gabriel Lippman
• X = height of an adult female, error in measurement
f  x,  ,   
 x   2
1
e
1/2
 2 
2
2 2
•Uniform distribution (can also be discrete)
• Extremely useful for sampling, transformations, bootstraps (will come back to),…
• Assigns an equal probability to every value in an interval
f  x, a , b  
1
I
b  a a xb
Expectations and variances
•We’d like to have some idea what to expect from random variables. Expectation and variance
give us that
•The expectation (or nmean) gives us the central tendency
• Discrete: E  X    xi P  X  xi 
i 1
• Continuous: E  X   xf  x  dx

•The variance quantifies the spread around the mean
2
Var  X   E  X  E  X  


Expectations and variances for
distributions
•These are in terms of the distributions’ parameters
• Binomial
• E=np, Var=np(1-p)
• Poisson
• E =  , Var = 
• Normal
• E =  , Var =  2
• Uniform
1
1
2
• E =  a  b  , Var =  b  a 
2
12
Nature doesn’t hand you parameters
•When you collect data, you don’t collect mu and sigma. We have to estimate them.
1 n
•Everyone knows the sample mean of a normal is  xi , but why?
n i1
•Assume we collect n independent samples from a normal distribution
•A very common way is called maximum likelihood estimation. Just like the name says, choose
the value of the parameter that maximizes the likelihood
•So to estimate the mean, calculate the joint probability of the samples:
n
P  x    P  xi  
i 1
n
1
e
1/2 
 2 
2
i 1
 xi   2
2 2
n


1
e i1
1/2
 2 
2
 xi   2
2 2
Mean derivation continued
•It’s usually easier to maximize the log likelihood:
n
P  x    P  xi  
i 1
n
1
 2 
2 1/2
 log P  x    log  2


e
 xi   2
2 2
n

i 1
2 1/2

 2 2
1
n
1
 2 
2 1/2
  xi   
e

i 1
 xi   2
2 2
2
i 1
•Maximize:
n
n
1/2
 
1
1

2
2
 log  2    2   xi      0  2   xi     0


 2 i 1
 
 i 1

n
n
1 n
  xi        xi
n i 1
i 1
i 1
How good are our estimates?
•Three main considerations
• Bias, Variance, Consistency
• We’ll talk about the first two
•An estimator is unbiased if it has the right expectation
1 n  1 n
E   xi    E  xi   
 n i 1  n i 1
•There are a lot of things to say about variance and estimators, but it’s probably enough to say
estimators with smaller variance are better.
• Both the usual sample mean and x1 are unbiased estimators of the mean. Compare variances:
1 n  1
Var   xi   2
 n i 1  n
n
Var  x  
i 1
i
2
n
Var  x1    2
Mean square error
•One way to measure error combines bias and variance ideas nicely.
•For an estimator ̂ , the MSE is:


2
E  ˆ   


•Which can also be written as:


 
 
2
2


ˆ
ˆ
ˆ




E    E      E       Var ˆ  Bias ˆ


2
•This gives rigor to why we would prefer an unbiased estimator with lower variance to one with
higher variance. The MSE will be less.
•But do we always want unbiased estimators?
A toy example
•Say we have a normal sample and we want to
estimate the mean of the distribution it came
from
• Consider two estimators of the mean
n
1

ˆ
1  1   xi  ,
n  i 1 
ˆ2 = x1
•The first is biased, the second one isn’t. Simulate
I might have started with this slide
What is a statistic?
What is a statistic?
•Any function of a random sample of data which can be
observed is a statistic
•Suppose we have an infinite bingo machine that spits out
bingo numbers with fixed probabilities that we don’t know
•We collect 100 bingo balls from the machine
•Which of the following are statistics?
• The average number on our bingo balls
• The proportion of our sample corresponding to each number
• The proportion of the balls in the machine corresponding to each
number
• The largest number we draw
Examples of common statistics
• Z-score:
xx
sd  x 
◦ “Wald” confidence interval for a parameter  that we estimate with̂ :
 
 
ˆ  z /2 se ˆ    ˆ  z /2 se ˆ
• Example for the mean:
x

n
z /2    x 

n
z /2
• Not always the best, but conceptually simple
• t statistic:
ˆ   0
 
se ˆ
Hypothesis Testing
•A lot of ways to decide on tests (Wald, likelihood ratio, score,…) but one general method looks
like:
•State a hypothesis and decide on a significance level
• i.e population mean = 2, significance level = .05
•Calculate test statistic and distribution of the test statistic
n
x 2
~ tn 1
s
•Calculate confidence interval:
x
s
s
tn 1,.025    x 
tn 1,.025
n
n
•If the null hypothesis is in the confidence interval, fail to reject
•If not, reject the null with 1-a/2 confidence
Where in the world are the p-values?
•P-values are tricky and semantic (so is all of stats, but p-values moreso)
•“The probability of having observed a test statistic as large or larger under the null
hypothesis”
• This and this alone
•You could have skipped this whole talk and read the ASA’s statement on p-values as well as some
of their references
• My favorites (and I really do recommend reading these responses as well as some of their other stuff)
are Deborah Mayo and Andrew Gelman’s responses. They have very different general philosophies
•Confidence intervals can perform hypothesis testing
•Effect size is more important than significance
•“The difference between statistically significant and not statistically significant is not statistically
significant”
An illustration of a p-value problem:
multiple comparisons
Lamar Smith’s most referenced chart
The beginning of regression
•At its most basic we have two vectors of observations
•
•
ŷ which contains the response variable
x̂ which contains the predictor variables
•The single linear regression model is:
yi   0  1 xi   i
• It assumes independent and identically distributed normal
errors
• The beta’s are the parameters we want to estimate
• Beta_0 is the intercept of the model, Beta_1 gives the slope
• Estimated by minimizing the squared error
n
 y  
i 1
i
0
 1 xi 
2
• Suppose y is a car’s stopping distance and x is the car’s
speed:
• Beta_0 is the stopping distance of a car moving 0 mph
• Beta_1 tells us how much further it takes a car to stop if it’s moving 1
mph faster
Multiple regression
•Same thing but instead of a single predictor we now have multiple predictors which we collect
into a matrix, X and mu
•Model:
y  X 
•Same assumptions as before, same interpretation as before.
•I’ll show the minimization for these. Write out the squared error:
|| y  X  ||2   y  X  
2
T
y X y
T
y  2  T Xy   T X T X 
•Minimize it:
 T
y y  2  T X T y   T X T X   2 X T y  2 X T X   0

 X T X   X T y     X T X  X T y
1
Regression diagnostics
•How to check assumptions?
•Data from 31 cut cherry trees
• Radius of tree
• Height of tree
• Amount of wood produced
Why R^2 isn’t a great measure
• Strictly increasing as a function of the number of variables
included
• Randomly generated a variable called non and included it
(call this one mod2, the other mod1)
• R^2 increased by .002, adj R^2 decreased by .02
• The larger the better
• Two easy alternatives are AIC and BIC
• Both “information criterion”
• Penalize for number of variables
• The smaller the better
• AIC
• Mod1=155
• Mod2 =161
• BIC
• Mod1=162
• Mod2=174
How (not) to choose variables?
•Could choose based on which variables are significant
• Don’t do this. We saw the test statistic was
ˆ   0
ˆi

ˆ
se ˆ
VIF ˆ
 
nVar  X i 
 
i
• The less correlated with other variables, the more variance in the predictor, the smaller the p-value
• Not what we want to measure
•Closest would be stepwise regression
• Still not great
• Depends on order variables are included (if correlated basically random choice. Good for boosted trees, not
good here)
• Biased coefficient estimates
• Multiple comparison issues
One of the perils of selecting by p-values
• True model:
• Poly model:
y   0  1 x1
20
y   0   i x1i
i 1
• Fit both:
• True:
• AIC=60
• BIC=60
• MSE=.513
• Poly
• AIC=-16
• BIC=-35
• MSE=.004
Overfitting
Fighting overfitting with train/test sets
•Easiest way to fix this is with a train/test set
•
•
•
•
Randomly sample some number of observations (rule of thumb is 1/5th)
Hide this data (test set)
Take the rest of it (training set) and train your models
Check the performance on the test sets and choose the model that does better
•Doing this for the previous example gives an MSE of .47 for the true model vs MSE of 1.05 for
the polynomial model
Another option: shrinkage
•Ridge regression:
• Shrinks beta coefficients to 0 but will never exactly hit 0
1
T
ˆ
min || y  X  ||2  ||  ||2 ,  =  X X   I  X T y
•LASSO:
min || y  X  ||2  ||  ||1
• Doesn’t look very different but remember norms we talked about:
|| x ||2 
n
x
i 1
i
n
2
,|| x ||1   | xi |
i 1
• Measures the size of the estimates differently
• This forces coefficients to 0
• Great for variable selection
Shrinkage (2)
• Compare the objective functions for ridge and lasso with regular ol’ regression
min || y  X  ||2  ||  ||2 , min || y  X  ||2  ||  ||1 , min||y  X  ||2
• The only difference is the bit at the end
• Called the penalty
• General form looks like L y, X    J 
• Lots and lots of models can be written this way (splines, total variation, svm,…)
• When lambda =0, regular regression, as lambda goes to infinity the coefs -> 0
• Need to specify lambda though!


 
Shrinkage example
•Auto dataset
• Car data with mpg, country of origin, cylinders, horsepower,
displacement, weight, acceleration, year
• Predict mpg by other variables
•Produced using glmnet package (Hastie, Tibshirani)
•The coefficients change a ton depending on lambda
•Can use something like the train/test set (cross validation) to
choose lambda
•
•
•
•
•
•
Get a candidate list of lambdas
Split the data up into k pieces (folds)
Set one fold aside as the test set and train on the other k-1
Test on the one not used
Do this for all of the folds and all the values of lambda
Choose the lambda with the lowest average test error
•Top is ridge, bottom is lasso
Shrinkage continued
• cross validation done with cv.glmnet
• The vertical lines are the chosen values of lambda
• If there are two lines, the smaller chosen lambda is the one that minimized MSE, the larger one is
one standard error greater (rule of thumb)
• In this case, we would choose either .07 or .71 for lasso and .73 or 1.8 for ridge. If we had leftover
data (that we hadn’t used at all!) we could cross-validate or test/train to choose between the
heuristics
• I actually created a test/set before the c.v process
• Ridge test MSE = 9.78 (on average each prediction off by 10.7%)
• Lasso test MSE = 10.6 (11.3%)
• Linear model = 10.7 (11.7%)
• What you give up for that little boost:
• Linear coefs: cyl = -.38, disp=.02, hp = -.018, wt = -.006, acl =.133, yr = .72, ori = 1.62
• Ridge coefs: cyl=-.29, disp=-.0007,hp=-.02, wt=-.003, acl=-.01,yr=.62,ori=1.41
• Lasso coefs: hp=-.007, wt=-.005, yr = .56, ori=.74
• Coefs biased towards 0, gain in variance. Remember MSE from earlier
• One great thing about ridge and lasso is they work when n = number of observations is less than p
=number of predictors
• Linear regression can’t be fit here
Dimensionality Reduction
•Often want to reduce the number of predictors, p
• For computation
• For interpretability
• To visualize
•Lots of ways, many work with metric multi-dimensional scaling (MDS)
•The general algorithm is:
• Calculate a distance matrix
• Solve the optimization problem
1/2


2

arg min    di , j  || zi  z j ||  
z1 ,..., zn
 ii ,jj 1,n



• Here d_i,j are the distances you calculated between x_i and x_j
• And the norm is in some specified metric
• We’ll cover two
• PCA (and a variant, sparse PCA)
• Isomap
PCA
• A very common method for dimensionality reduction
• A vector (or an observation) can be considered a
“direction” in some high-dimensional space
• PCA finds the “directions” that contain the most
variance in your data
• If the distance matrix is calculated with the regular L2
(Euclidean) distance, the formula on the last page gives
you PCA
• These directions are linear combinations of the original
predictors (weights called loadings)
• You keep the k directions with the most variance, and
project your data onto this smaller number of
directions
• The directions are your new predictors, the projected
points are your new observations
• Usually, 2 or 3 directions contain 95%+ of the variance
• I used princomp in R on the cars data
PCA on the cars data
•PCA on cars data (minus mpg), kept top two
components, color by mpg
•PC1 loadings
• Cyl = -.001, disp=-.11, hp=-.03, wt = -.99, acc =
.001, yr = .001, ori=.0005
•PC2
• Cyl =-.01,disp=-.94,hp=-.29, wt =.12, acc=.03,
yr=.02, ori=.003
•Could perform prediction with the new
PCA’s
•Issues:
• Hard to interpret
• Can only work with pretty standard data
• Linear embedding
Isomap and the swiss roll
•A non-linear embedding technique
• Instead of calculating the distance matrix by Euclidean distance, calculate a graph distance
• For every point, find it’s k-nearest neighbors
• Create a graph where each node represents a point. Edges between nodes mean the nodes are one of each other’s k-nearest neighbors.
Edge length is Euclidean distance.
• Calculate the shortest path between every two nodes. That’s the distance matrix in the original objective function
• Solve the embedding problem
Isomap and faces/hands
Thanks for coming!
•The code to create the demos in the talk can be found at my github
•The slides can be found on my website