Parametric Methods of Estimating Probabilities

Parametric Methods of
Estimating Probabilities
Chapter 4- Introduction to Machine Learning, Ethem Alpaydin
Introduction
• We all know how optimal decisions are made when uncertainty is
modeled using probabilities.
• In this talk, we will see how we can estimate these probabilities
from a given training set.
• We just talk about the parametric approach for classification and
regression.
1. Maximum Likelihood Estimation
• Best way to explain MLE is this: You only get to see what the
nature wants you to see. Things you see are facts. These facts
have an underlying process that generated it. These process are
hidden, unknown, needs to be discovered. Then the question is:
Given the observed fact, what is the likelihood that process P1
generated it? What is the likelihood that process P2 generated it?
And so on... One of these likelihoods is going to be max of all. MLE
is a function that extracts that max likelihood.
1.a. Bernoulli Density
• In Bernoulli distribution, there are two outcomes: An event occurs or it does not.
• Example- probability of getting a heads (a success) while flipping a coin is 0.5. The
probability of “failure” is 1-P( one minus the probability of success, which also
equals 0.5 for a coin toss) N=1 i.e single trial.
• The probability density function (pdf) for this distribution is P
written as:
x
(1-P)1-x can also be
1.b. Multinomial Density
• In Multinomial distribution, there are more than two outcomes. It is an extension of Bernoulli density.
• Suppose that we have an experiment with n independent trials, where each trial produces exactly one
of the events E1, E2, . . ., Ek (i.e. these events are mutually exclusive and collectively exhaustive), and
on each trial, Ej occurs with probability πj , j = 1, 2, . . . , k.
• Notice that π1 + π2 + · · · + πk = 1. The probabilities, regardless of how many possible outcomes, will
always sum to 1
• Let's define the random variables:
•
X1 = number of trials in which E1 occurs,
X2 = number of trials in which E2 occurs,
...
Xk = number of trials in which Ek occurs.
• Then X = (X1, X2, . . ., Xk) is said to have a multinomial distribution with index n and parameter π =
(π1, π2, . . . , πk). In most problems, n is regarded as fixed and known.
1.c. Gaussian Density
• X is normal distributed with mean E[X]= μ and variance Var(X)= σ2,
denoted as N(μ, σ2) if it’s density function is
• P(x)=
1
2𝜋𝜎
exp
(𝑥−𝜇)2
−
2𝜎 2
, −∞ < 𝑥 < ∞
• Normal distributions are important in statistics and are often used
in the natural and social sciences to represent real-valued random
variables whose distributions are not known.
2. Bayes Estimator
• For point or interval estimation of a parameter θ in a model M
based on data y, Bayesian inference is based off,
𝑝 𝜃𝑦 =
𝑝(𝑦|𝜃) 𝑝(𝜃)
𝑝(𝑦)
∝ 𝑝 𝑦 𝜃 𝑝(𝜃)
Where
P(θ) is the prior density for parameter
P(θ|y) is the posterior density for parameter, and
P(y|θ) is the statistical model or likelihood
Steps in calculating Bayes Estimation
• Constructing the prior
• Finding the likelihood
• Modifying the prior using the likelihood to get the posterior
Definitions• Prior: P(θ), expresses our opinion of θ before the data is factored
in
• Posterior: P(θ|y), expresses the conditional probability after the
evidence is taken into account.
EXAMPLE
Comparison between the two estimators
• MLE is not as reliable as Bayes estimation.
• Imagine you are a doctor. You have a patient who shows an odd set of symptoms. You look in your
doctor book and decide the disease could be either a common cold or lupus.
Your doctor book tells you that if a patient has lupus then the probability that he will show these
symptoms is 90%.
It also states that if the patient has a common cold then the probability that he will show these
symptoms is only 10%.
Which disease is more likely?
Well, there are two approaches to take. If you used maximum likelihood estimation you would
declare, "The patient has lupus. Lupus is the disease which maximizes the likelihood of presenting
these symptoms."
However, a cannier doctor would remember that lupus is very rare. This means that the prior
probability of anyone having lupus is very low (~5 per 100K people) compared to common cold
(which is common). Using a Bayesian estimate you should decide that the patient is more likely to
have a common cold than lupus.
Thank you