Probabilistic Modeling and Joint Distribution Model

Probabilistic Modeling
and
Joint Distribution Model
1
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Introduction
Concerned with analysis of random phenomena
Originated from gambling & games
Uses ideas of counting, combinatorics and measure
theory
Uses mathematical abstractions of non-deterministic
events
2
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Introduction
Continuous probability theory deals with events that
occur in a continuous sample space
Discrete probability deals with events that occur in
countable sample spaces
Events : a set of outcomes of an experiment
Events : a subset of sample space
3
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Axioms of Probability
Nonnegativity : 0≤P(E)≤1
Additivity : P ( E 1 , E 2 ..., E n ) =
∑
i
P (Ei )
Normalization (unit measure): P(Ω) =1, P(∅)=0
Some consequences:
P(Ω \ E) = 1-P(E)
P(A U B) = P(A) + P(B) – P(A∩B)
P(A \ B) = P(A) – P(B) if B ⊆ A
4
{Ω : universe}
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Conditional probability
Bayes Rule : P(A|B) =P(A,B) / P(B)
OR :
P(A|B) =P(B|A).P(A) / P(B)
Independency condition : P(A,B) = P(A).P(B)
Mutually exclusive events : P(A,B) = 0
Mutually exclusive events : P(A U B) = P(A) + P(B)
OR
P(A \ B) = P(A)
5
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Random Variables
A variable
A function mapping the sample space of a random
process to the values
Values can be discrete or continuous
Each outcome as value (or a range) is assigned a
probability
6
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Random Variables
A variable
A function mapping the sample space of a random
process to the values
Values can be discrete or continuous
Discrete example : fair coin toss
X= 1 if heads, 0 if tails
Or fair dice roll : X=
7
{
}
{ “the number shown on dice”}
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Random Variables
Continuous example: spinner
Outcome can be any real number in [0,2π)
Any specific value has zero probability
So we use ranges instead of single points
E.g. having a value in [0,π/2 ] has probability 1/4
8
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Random Variables
In case of discrete random variables we use
probability mass function
{
}
PX ( x) = 1/2 if X=0, 1/2 if X=1, 0 otherwise
Notice the use of uppercase for the random variable
and lowercase for the mass function variable
Cumulative distribution function (CDF) :
FX ( x) = P( X ≤ x)
9
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Random Variables
In case of continuous variables,
We use a probability density function
b
PX [ a ≤ X ≤ b ] =
∫ p ( x ) dx
a
So that the CDF becomes
x
FX ( x) =
∫ p(u)du
−∞
10
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Well Known Distributions
11
Discrete uniform distribution
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Well Known Distributions
Binomial distribution
Special case : n=1 -> Bernoulli distribution
12
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Well Known Distributions
13
Special case : n=1 -> Bernoulli distribution
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Well Known Distributions
14
Poisson distribution : n events occur with a known
average rate λ and independently of the time since
the last event
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Expected Value and Variance
Expected value : A measure of probability weighted
average of expected outcomes
Variance : expected value of the square of the
deviation of random variable from its expected value
15
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Elements of Probability Theory
Joint Distributions
More than one random variable
On the same probability space (universe)
Events defined in terms of all variables
Called multivariate distribution
Called bivariate if two variables involved
Remembering Bayes rule, conditional distribution:
16
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Joint Distributions
Similar to probabilities, if variables are independent:
Continuous distribution case:
Marginal distributions:
Reduces to simple product summation if independent
17
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Random Configurations
In general a set of n random variables:
V = (V1 , V2 ..., Vn )
With possible outcomes for each variable:
{x1 , x2 ..., xm }
A configuration is a vector of x where each value is
assigned to a variable
x = ( x1 , x2 ..., xn )
CSCI 6509 Notes Fall 2009
Faculty of Computer Science Dalhousie University
18
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Random Configurations
In modeling we assume a sequence of configurations:
x (1) ,..., x (t )
x (1) = ( x11 , x12 ,...x1n )
x (2) = ( x21 , x22 ,...x2 n )
x (t ) = ( xt1 , xt 2 ,...xtn )
Here we assume a fixed number (n) of components in each
configuration, and xij are values from finite set
19
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Random Configurations
NLP uses probabilistic modeling as a framework for
solving problems
Computational tasks:
20
Representation of models
Simulation : generating random configurations
Evaluation : computing probability of a complete configuration
Marginalization : computing probability of a partial configuration
Conditioning : computing conditional probability of completion
given partial observation
Completion : find most probable completion of partial observation
Learning : parameter estimation
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Joint distribution model
A joint probability distribution P(X1 = x1, X2 = x2 ,....Xn = xn)
specifies the probability of each complete
configuration
21
In general it takes m x n parameters (less one
constraint) to specify an arbitrary joint distribution on
n random variables with m values
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Joint distribution model
This can be captured in lookup table θ x (1 ) , θ x (1 ) , ...θ x ( V n )
where θ x gives the probability of RV’s taking on
jointly the configuration x ( k )
(k )
So
θx = P(X = x(k) )
(k )
V
Satisfying
∑θ
k =1
22
x( k )
=1
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
More on computational tasks
Simulation : Given the lookup table representation, compute the
(k )
cumulative value θ x of the configurations, select the x
whose
(k )
cumulative probability interval contains a given p value
Evaluation : Evaluate the probability of a complete configuration
x = ( x1 , x2 ..., xn )
P ( X 1 = x1 ,... X n = xn ) = θ ( x1 x2 ..... xn )
From the lookup table:
Marginalization : the probability of an incomplete configuration:
P( X1 = x1,...Xn = xn ) = ∑....∑P( X1 = x1,...X k = xk , Xk +1 = yk +1...., X n = yn )
yk+1
From lookup table:
=
yn
∑
y k +1
23
. . . .∑ θ
( x1 x 2 . . . . . x k , y k + 1 . . . . . y n )
yn
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
More on computational tasks
Completion : Compute the conditional probability of a possible
completion
( yk +1 , yk + 2 ..., yn )
given an incomplete configuration x = ( x1 , x2 ..., xn )
Need to evaluate a complete configuration and then divide by a
marginal sum
θ
( x1 x 2 ..... x k y k +1 .... y n )
∑ ....∑ θ
z k +1
24
( x1 x 2 ..... x k , z k +1 ..... z n )
zn
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Example
Spam detection : an arbitrary e-mail message is
spam or not
Caps = ‘Y’ if the message subject line does not contain lowercase letter, ‘N’
otherwise,
Free = ‘Y’ if the word ‘free’ appears in the message subject line (letter case is
ignored), ‘N’ otherwise,
and
Spam = ‘Y’ if the message is spam, and ‘N’ otherwise.
Randomly select 100 messages, count how many times each configuration
appears
25
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Example
Given a fully specified joint distribution table, one can lookup the
probability of any configuration. For example:
P(Free = Y; Caps = Y; Spam = Y ) = 0.2
P(Free = Y; Caps = N; Spam = N) = 0.0
26
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Joint distribution model
Drawbacks of Joint Distribution Model:
27
memory cost to store table
running-time cost to do summations
the sparse data problem in learning
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Generative Model
Idea for traditional generative model:
what does the automaton below generate ?
I
he
know
that
sky
is
blue
knows
I know that sky is blue, I know that he knows that sky is blue, I know that I
know that sky is blue, …
But not : sky is blue, I know he, I blue that…
This is the language of this automaton
28
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Generative Model
Idea for probabilistic generative model:
Qi
P(STOP|Qi) = 0.2
(Manning, Raghavan & Schutze, 2009)
29
string assigned
probability
the 0.2
a
0.1
frog 0.01
toad 0.01
said 0.03
likes 0.02
that 0.04
….
….
If instead each node has a probability distribution
over generating different terms, we have a language
model
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Generative Model
A language model is a function that puts a
probability measure over strings drawn from some
vocabulary
∑ P (ti ) = 1
i∈ Σ *
30
Each P(ti ) is a term emission probability in this
unigram model
Such a model places a probability distribution over
any sequence of words
By construction, it also provides a model for
generating text according to its distribution
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Generative Model
31
P( frog said that toad likes frog ) = (0.01 ×0.03 × 0.04
× 0.01 × 0.02 × 0.01) {emission probabilities}
X (0.8 ×0.8 × 0.8 × 0.8 × 0.8 × 0.8 × 0.2)
{continue/stop probabilities}
= 0.000000000001573
Usually continue/stop probabilities are omitted when
comparing models
Based on computed value, a model is more likely
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Generative Model
Compare this model to the previous model:
string assigned
probability
the
0.15
a
0.12
frog
0.0002
toad 0.0001
said
0.03
likes 0.04
that 0.04
….
….
{omitting P(stop)}
P(s|M1) = 0.00000000000048
P(s|M2) = 0.000000000000000384
So model 1 is more likely
32
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Types of Generative Models
In general for a sequence of events using earlier
successive events using Bayesian Inference Rule:
P (t1t2t3t4 ) = P (t1 ) P(t2 | t1 ) P(t3 | t1t2 ) P (t4 | t1t2t3 )
If total independence among events exists:
Puni (t1t2t3t4 ) = P(t1 ) P(t2 ) P(t3 ) P(t4 )
33
This is unigram model
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Types of Generative Models
If only conditioning is on the previous term
Pbi (t1t2t3t4 ) = P (t1 ) P(t2 | t1 ) P(t3 | t2 ) P(t4 | t3 )
This is bigram model
Unigram models frequently used when sentence
structure is not important
E.g. in IR but not in speech recognition
34
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Types of Generative Models
Unigram models are of type ‘bag of words’
Pbi (t1t2t3t4 ) = P (t1 ) P(t2 | t1 ) P(t3 | t2 ) P(t4 | t3 )
Recalls a multinomial distribution of probabilities
over words
Ld !
ft1 ,d
f t2 , d
ftM ,d
P(d ) =
P (t1 ) P(t2 ) ....P(tM )
tf t1 ,d !tft2 ,d !...tftM ,d !
35
Where Ld is the length of document d with vocabulary of size M
Observe here the positions of the terms are insignificant
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Types of Generative Models
Fundamental question: which model to use?
Speech recognition: the model has to be general
enough beyond observed data to allow unknown
sequences
IR : a document is finite and mostly fixed
36
Get a representative sample
Build a language model for document
Calculate generative probabilities of sequences from the model
Rank documents by probability ranking principle
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Approaches
Probability Ranking Principle
37
rank documents by their estimated probability of
relevance
P(R = 1|d, q) for document d, query q
Basic case : 1/0 loss
Rank documents, return top k
Non restrictive case : Bayes optimal decision rule
d is relevant iff P( R = 1|d, q) > P(R = 0|d, q)
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Approaches
Probability Ranking Principle
If cost is involved:
C0 · P(R = 0|d) − C1 · P(R = 1|d)
≤
C0 · P(R = 0|d′) − C1 · P(R = 1|d′))
where
C1 = cost of missing relevant document
C0 = cost of returning nonrelevant document
38
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Types of Other Generative Models
Rather than a document model, and checking
likelihood of generating query,
Build a query model and check likelihood of
generating a document
OR: use both approaches together
Needs a measure of divergence between document and query models
Kullback-Leibler divergence:
R(d ; q ) = ∑ P (t | M q ) log
t∈V
39
P(t | M q )
P(t | M d )
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Probabilistic Modeling
Types of Other Generative Models
Translational model generates query words not in a
document by translating into alternate terms with
similar meaning,
Needs to know conditional probability distribution
between vocabulary terms
P(q | M d ) = ∏ ∑ P(v |M d )T (t | v)
t∈q v∈V
40
Where P(q | M d ) is the query translation model, P(v | M d ) is the
document language model, T (t | v) is the conditional probability
distribution between vocabulary terms
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu
Sources :
CSCI 6509 Notes Fall 2009
Faculty of Computer Science Dalhousie University
http://www.cs.dal.ca/~vlado/csci6509/coursecalendar.html
Manning, Raghavan & Schutze, 2009, An introduction to information
retrieval
Jurafsky, Martin, 2000, An Introduction to NLP,Computational
Linguistics and Speech Recognition
Ghahramani,2000, Fundamentals of Probability
41
Probabilistic Modeling / Joint Distribution Model
Haluk Madencioglu