Parametric Methods of Estimating Probabilities Chapter 4- Introduction to Machine Learning, Ethem Alpaydin Introduction • We all know how optimal decisions are made when uncertainty is modeled using probabilities. • In this talk, we will see how we can estimate these probabilities from a given training set. • We just talk about the parametric approach for classification and regression. 1. Maximum Likelihood Estimation • Best way to explain MLE is this: You only get to see what the nature wants you to see. Things you see are facts. These facts have an underlying process that generated it. These process are hidden, unknown, needs to be discovered. Then the question is: Given the observed fact, what is the likelihood that process P1 generated it? What is the likelihood that process P2 generated it? And so on... One of these likelihoods is going to be max of all. MLE is a function that extracts that max likelihood. 1.a. Bernoulli Density • In Bernoulli distribution, there are two outcomes: An event occurs or it does not. • Example- probability of getting a heads (a success) while flipping a coin is 0.5. The probability of “failure” is 1-P( one minus the probability of success, which also equals 0.5 for a coin toss) N=1 i.e single trial. • The probability density function (pdf) for this distribution is P written as: x (1-P)1-x can also be 1.b. Multinomial Density • In Multinomial distribution, there are more than two outcomes. It is an extension of Bernoulli density. • Suppose that we have an experiment with n independent trials, where each trial produces exactly one of the events E1, E2, . . ., Ek (i.e. these events are mutually exclusive and collectively exhaustive), and on each trial, Ej occurs with probability πj , j = 1, 2, . . . , k. • Notice that π1 + π2 + · · · + πk = 1. The probabilities, regardless of how many possible outcomes, will always sum to 1 • Let's define the random variables: • X1 = number of trials in which E1 occurs, X2 = number of trials in which E2 occurs, ... Xk = number of trials in which Ek occurs. • Then X = (X1, X2, . . ., Xk) is said to have a multinomial distribution with index n and parameter π = (π1, π2, . . . , πk). In most problems, n is regarded as fixed and known. 1.c. Gaussian Density • X is normal distributed with mean E[X]= μ and variance Var(X)= σ2, denoted as N(μ, σ2) if it’s density function is • P(x)= 1 2𝜋𝜎 exp (𝑥−𝜇)2 − 2𝜎 2 , −∞ < 𝑥 < ∞ • Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. 2. Bayes Estimator • For point or interval estimation of a parameter θ in a model M based on data y, Bayesian inference is based off, 𝑝 𝜃𝑦 = 𝑝(𝑦|𝜃) 𝑝(𝜃) 𝑝(𝑦) ∝ 𝑝 𝑦 𝜃 𝑝(𝜃) Where P(θ) is the prior density for parameter P(θ|y) is the posterior density for parameter, and P(y|θ) is the statistical model or likelihood Steps in calculating Bayes Estimation • Constructing the prior • Finding the likelihood • Modifying the prior using the likelihood to get the posterior Definitions• Prior: P(θ), expresses our opinion of θ before the data is factored in • Posterior: P(θ|y), expresses the conditional probability after the evidence is taken into account. EXAMPLE Comparison between the two estimators • MLE is not as reliable as Bayes estimation. • Imagine you are a doctor. You have a patient who shows an odd set of symptoms. You look in your doctor book and decide the disease could be either a common cold or lupus. Your doctor book tells you that if a patient has lupus then the probability that he will show these symptoms is 90%. It also states that if the patient has a common cold then the probability that he will show these symptoms is only 10%. Which disease is more likely? Well, there are two approaches to take. If you used maximum likelihood estimation you would declare, "The patient has lupus. Lupus is the disease which maximizes the likelihood of presenting these symptoms." However, a cannier doctor would remember that lupus is very rare. This means that the prior probability of anyone having lupus is very low (~5 per 100K people) compared to common cold (which is common). Using a Bayesian estimate you should decide that the patient is more likely to have a common cold than lupus. Thank you
© Copyright 2026 Paperzz