Probability and likelihood

Probability and likelihood - Sortie-ND

The Triangle of Statistical Inference: Likelihoood
Data
Inference
Probability
Model
Scientific
Model
An example...
Data
Inference
Probability
Model
The Data:
xi = measurements of DBH on 50 trees
yi = measurements of crown radius on those trees
The Scientific Model:
yi = a + b xi + e (linear relationship, with 2
parameters (a, b) and an error term (e) (the residuals))
The Probability Model:
e is normally distributed, with E[e ] and variance
estimated from the observed variance of the residuals...
Scientific Model
(hypothesis)
So what is likelihood –and what is it good for?
1. Probability based (“inverse probability”).
“ mathematical quantity that appears to be
appropriate for measuring our order of preference
among different possible populations but does not
in fact obey the laws of probability”
--RA. Fischer
2. Foundation of theory of statistics.
3. Enables comparison of alternate models.
So what is likelihood –and what is it good for?



Scientific hypotheses cannot be treated as
outcomes of trials (probabilities) because we will
never have the full set of possible outcomes.
However, we can calculate the probability of
obtaining the results, given our model (scientific
hypothesis (P(data|model).
Likelihood is proportional to this probability.
Likelihood is proportional to probability
P(data|hypothesis (q )) L(hyp|data)
P(data|hypothesis (q )) = kL(q| data)
In plain English: “The likelihood (L) of the set of
parameters (q) (in the scientific model), given the data (x),
is proportional to the probability of observing the data,
given the parameters...”
{and this probability is something we can calculate, using
the appropriate underlying probability model (i.e. a PDF)}
Parameter values can specify your hypotheses
P(datai|θ) = kL(θ |data)
Parameter is variable, data fixed.
What is the likelihood of the parameter
given the data?
Parameter is fixed, data
variable. What is the prob. of
observing the data if our model and
parameters are correct?
General Likelihood Function
0.4
Data (xi )
L(θ|x) = cg(x|θ )
0.35
0.3
Probability
Likelihood
function
0.25
0.2
0.15
0.1
0.05
0
-4
Parameters in probability
model
-3
-2
-1
0
1
2
3
4
5
Probability density function or
discrete density function
c is a constant, and thus, unimportant in comparison of alternate
hypotheses or models as long as the data remain constant.
General Likelihood Function
0.4
Data (xi )
n
L(θ|x) =  g(xi|θ )
i 1
0.35
0.3
Probability
Likelihood
function
0.25
0.2
0.15
0.1
0.05
0
-4
Parameters in probability
model
-3
-2
-1
0
1
2
3
4
5
Probability density function or
discrete density function
The parameters of the pdf are determined by the data and
by the value of the parameters in scientific model!!
Likelihood Axiom
“Within the framework of a statistical model, a
set of data supports one statistical
hypothesis better than other if the likelihood
of the first hypothesis, on the data, exceeds
the likelihood of the second hypothesis”.
(Edwards 1972, p.)
How to derive a likelihood function: Binomial
Event 10 trees die out of a population of 50
Question: What is the mortality rate (p)?
n x
g( x )    p ( 1  p )n  x
 x
L( p | x )  cg ( x | p )
Probability Density Function
L( p | 10 )  g( 10 | p )  p10 ( 1  p )50 10
The most likely parameter value is 10/50 = 0.20
Likelihood
Likelihood Profile: Binomial
L( p | 10 )  g( 10 | p )  p10( 1  p )5010
1.6E-11
The model (parameter p) is defined by the data!!
1.4E-11
1.2E-11
likelihood
1E-11
8E-12
6E-12
4E-12
2E-12
0
0
0.2
0.4
0.6
0.8
-2E-12
Value of estimated parameter (p)
1
Data
An example: Can we predict tree fecundity as a
function of tree size?
Inference
The Data:
xi = measurements of DBH on 50 trees
yi = counts of seeds produced by trees
Probability
Model
The Scientific Model:
yi = DBH b + e exponential relationship, with 1
parameter (b) and an error term (e)
The Probability Model:
Data follow a Poisson distribution, with E[x] and
variance = λ
Scientific
Model
Data
Iterative process
Inference
Probability
Model
Scientific Model
(hypothesis)
1. Pick a value for the parameter in your scientific model, b.
Recall scientific model is yi = DBHb
2. For each data point, calculate the expected (predicted) value
for that value of b.
3. Calculate the probability of observing what you observed
given that parameter value and your probability model.
4. Multiply the probabilities of individual observations.
5. Go back to 1 until you find maximum likelihood estimate
for parameter b.
Likelihood Poisson Process

e ()
P( X  x ) 
x!
x
E[x]= λ
First pass…
Model yi = DBH b
b 2
e (  )x
P( X  x ) 
x!
Predicted = 0.0617
Observed = 2
e  pred ( pred ) x
P( X  2 ) 
 0.0017
x!
E[x]= λ
1
0.9
Poisson random
Variable with
E[x1]=0.0617
0.8
Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
01
1
2
Number of seeds
2
3
Do for n observations……
Pick a new value of beta...
Model yi = DBH
b 0.5
0.7
e (  )x
P( X  x ) 
x!
b
e  pred ( pred ) x
P( X  2 ) 
 0.075
x!
Predicted = 0.498
Observed = 2
0.6
Probability
0.5
Poisson random
Variable with
E[x1]=0.498
0.4
0.3
0.2
0.1
0
01
12
23
34
Number of seeds
45
6
Do for n observations……
Probability and Likelihood
1. Multiplying probabilities is not convenient from a
computational point of view.
2. We take the log of the probabilities and we maximize that
number.
3. This gives us the Maximum Likelihood Estimate of the
parameter.
Likelihood Profile
Beta
-165.0
-165.5 0
0.2
0.4
0.6
Log Likelihood
-166.0
-166.5
-167.0
ML estimate
-167.5
-168.0
-168.5
-169.0
-169.5
-170.0
-170.5
Model yi = DBH b
0.8
1
1.2
Data
Model comparison
Inference
Probability
The Data:
Model
xi = measurements of DBH on 50 trees
yi = counts of seed produced by trees
Scientific Model
(hypothesis)
The Scientific Models:
yi = DBH b + e exponential relationship, with 1 parameter (b)
OR
yi = g DBH + e linear relationship with 1 parameter (g )
The Probability Model:
Data follow a Poisson distribution, with E[x] and variance = λ
Data
Model comparison
Inference
Probability
The Data:
Model
xi = measurements of DBH on 50 trees
yi = counts of seed produced by trees
Scientific Model
(hypothesis)
The Scientific Models:
yi = DBH b + e exponential relationship, with 1 parameter (b)
The Probability Model:
Data follow a Poisson distribution, with E[x] and variance = λ
OR
Data follow a negative binomial distribution with E[x]=m and
clumping parameter k. (Variance is defined by m and k
(estimated).
Determination of appropriate likelihood function
FIRST PRINCIPLES
1. Proportions Binomial
2. Several categories Multinomial
3. Count events Poisson, Neg. binomial
4. Continuous data, additive processes Normal
5. Quantities from multiplicative probabilities Lognormal,
Gamma.
EMPIRICAL
1. Examine residuals.
2. Tests different probability distributions for model errors.
Probability models can be thought of as competing hypotheses in
exactly the same way that different parameter values (structural
models) are competing hypotheses.
Likelihood functions: An aside about logarithms
Taking the logarithm in base a of a number is the inverse of raising
that number to the power a. Example: log101000= 3
Basic Log Operations
log( a * b )  log( a ) + log( b )
a
log    log( a )  log( b )
b
log( b )  a * log( b )
a
log a ( a )  1
Poisson Likelihood Function
e (  )x
P( X  x ) 
x!
n
L(  | x )  
i 1
Discrete Density Function
e   (  ) xi
xi !
Likelihood

n
Loglikelih ood (  | x )   [   + xi ln   ln( xi ! )]
i 1
E [ X ]   Variance  
Negative Binomial Distribution Likelihood Function
k
( k + n )  m   m 
Pr( X  n ) 
1 +  

( k )n! 
k  m+k
N
Li  
i 1
n
Discrete Density Function
( ni + k )
mi k k
( ni + 1 )( k ) ( mi + k )( ni + k )
Likelihood
k is an estimated parameter!!
Loglikelih ood ( x | q )  N [ k ln( k ) +  ln( k )] +
N
 [ ln ( ni + k )  ni ln( mi )  ln ( ni + 1 )  ( ni + k ) ln( mi + k )]
i 1
m2
E [ X ]  m Variance  m +
k
Normal Distribution Likelihood Function
f(x)
1
2 2
n
L(  , | x )  
i 1
exp( 
( x   )2
2
2
)
Prob. Density Function
( xi   )2
exp( 
)
2
2
2 2
1
Likelihood
LogLikelihood (  , | x ) 
n
1
( xi   )2
 n [ln(  ) + ln( 2 )]   (
)
2
2
2
i 1
E[x] = μ
Variance = δ2
Lognormal Distribution Likelihood Function
 ln( x )   )2 
1

f(x)
exp 
2


2
2 2 x


1
n
L(  , | x )  
i 1
Prob. Density Function
 ln( xi )   )2 
1

exp 
2


2
2 2 x


1
Loglikelih ood ( q , | x ) 

1
ln( xi )  ln( x̂i ) + 
 n [ ln( 2 ) + ln(  )]   [
2
2 2
i 1
n
E( q )  ln( x̂i ) 
2
2
n
2 
i 1
2
2
Likelihood
)
2
+ ln( xi )]
[ln( xi )  ln( x̂i )] 2
n
Gamma Distribution Likelihood Function
f ( x) 
1
a 1  x / s
x
e
a
s (a )
Prob. Density Function
E[ x]  as
Var[ X ]  as 2
a  shape parameter
s  scale parameter
n
LogLik    ln( a ) + a ln( s ) + (a  1) ln( xi )  xi / s
i 1
Exponential Distribution Likelihood Function
f ( x )  e
 x
Prob. Density Function
n
L    e  x i
Likelihood
i 1
n
LogLikelihood ( x |  )   ln(  )  xi
i 1
E[ x ] 
1

Variance 
1
2
Evaluating the strength of evidence for the MLE
Now that you have an MLE, how should you evaluate it?
Two purposes of support/confidence intervals
• Measure of support for alternate parameter
estimates.
• Help with fitting when something goes
wrong.
Methods of calculating support intervals
• Bootstrapping
• Likelihood curves and profiles
Bootstrapping
• Resample the data with replacement and
record the number of times that the
parameter estimate fell within an interval.
• Frequentist approach: If I sampled my data
a large number of times, what would my
confidence in the estimate be?
General method
• Draw the likelihood curve (one parameter)
or surface (two parameters) or ndimensional space (n-parameters).
• Figure out how much the likelihood
changes as the parameter of interest moves
away from the MLE.
Strength of evidence for particular
parameter estimates – “Support”
Log-likelihood = “Support” (Edwards 1992)
• Likelihood provides an objective measure
of the strength of evidence for different
parameter estimates...
Log-Likelihood
-147
-149
-151
-153
-155
2
2.1
2.2
2.3
2.4
2.5
Parameter Estimate
2.6
2.7
2.8
Asymptotic vs. Simultaneous
M-Unit Support Limits
• Asymptotic:
– Hold all other parameters at their MLE values, and
systematically vary the remaining parameter until
likelihood declines by a chosen amount (m)
-147
What should “m”
be? (1.92 is a
good number, and
is roughly
analogous to a 95%
CI)
Log-Likelihood
Maximum likelihood estimate
-149
-151
-153
2-unit support interval
-155
2
2.1
2.2
2.3
2.4
2.5
Parameter Estimate
2.6
2.7
2.8
An aside on the Likelihood Ratio Test
• Ratios of log-likelihoods (R) follow a chi-square
distribution with degrees of freedom equal to
the difference in the number of parameters
between models A and B.
R  2[ L( Y | M A )  L( Y | M B )]
Asymptotic vs. Simultaneous
M-Unit Support Limits
• Simultaneous:
– Resampling method: draw a very large number of
random sets of parameters and calculate log-likelihood.
M-unit simultaneous support limits for parameter xi are
the upper and lower limits that don’t differ by more
than m units of support.
• Set the focal parameter to a range of values and
for each value optimize the likelihood for all the
other parameters:
In practice, it can require an enormous number of iterations to
do this if there are more than a few parameters
Asymptotic vs. Simultaneous Support Limits
Parameter 2
A hypothetical likelihood surface for 2 parameters
Simultaneous 2-unit
support limits for P1
2-unit drop
in support
Asymptotic 2-unit
support limits for P1
Parameter 1
Other measures of strength of evidence for
different parameter estimates
• Edwards (1992; Chapter 5)
– Various measures of the “shape” of the
likelihood surface in the vicinity of the MLE...
How pointed is the peak?...
Evaluating Support for Parameter
Estimates
• Traditional confidence intervals and standard errors of
the parameter estimates can be generated from the
Hessian matrix
– Hessian = matrix of second partial derivatives of
the likelihood function with respect to parameters,
evaluated at the maximum likelihood estimates
– Also called the “Information Matrix” by Fisher
– Provides a measure of the steepness of the
likelihood surface in the region of the optimum
– Evaluated at MLE points it is the observed
information matrix
– Can be generated in R using optim
An example from R
• The Hessian matrix (when maximizing a log
likelihood) is a numerical approximation for Fisher's
Information Matrix (i.e. the matrix of second partial
derivatives of the likelihood function), evaluated at the
point of the maximum likelihood estimates. Thus, it's a
measure of the steepness of the drop in the likelihood
surface as you move away from the MLE.
> res$hessian
a
a
b
Sd
-150.182
-2758.360
-0.202
b
-2758.360
-67984.416
-5.926
sd
-0.201
-5.925
-299.422
The Hessian CI
• Now invert the negative of the Hessian matrix to get the
matrix of parameter variance and covariance
> solve(-1*res$hessian)
a
a
2.613229e-02
b
-1.060277e-03
sd
3.370998e-06
b
-1.060277e-03
5.772835e-05
-4.278866e-07
sd
3.370998e-06
-4.278866e-07
3.339775e-03
• The square roots of the diagonals of the inverted negative
Hessian are the standard errors
> sqrt(diag(solve(-1*res$hessian)))
a
b
sd
0.1616 0.007597
0.05779
(*and 1.96 * S.E. is a 95% CI)
• Are we reverting to a frequentist framework?
Some references
A.W.F. Edwards. 1972. Likelihood.
Cambridge University Press.
Feller, W. 1968. An introduction to
probability theory and its application.
Wiley & Sons.

Download Report

Probability and likelihood - Sortie-ND

Paperzz.com

Your Paperzz