Probability Basics and Language Models

Ranking-I
Probability Basics and Language Models
Temporal Information Retrieval
Probability Basics : Independence and Conditionals
•
Event space: All possible events
•
•
Coin: {T, H}, doc: {w1, …. , w|v|}, dice: {1,2,3,4,5,6}
Total probability principle: Sum of prob. of events is 1
X
P (xi ) = 1
Event Space
A
xi 2E
•
Independence: A and B are independent if and only if
•
•
•
P (A and B) = P(A) . P(B)
independence means “unrelated” — consecutive throws of coins
Dependence: A and B are influenced by each other
•
P (A|B) is read as prob. of A given that B is true or B is given
Avishek Anand
2
B
Probability Basics : Conditional Independence
A = X Studies CS
B = X knows programming
C = X plays cricket
A and B dependent
B and C independent
P(B,C) = P(B) . P(C)
P(B | C) = P(B)
P(A,B) = P(A | B) . P(B)
= P(B | A) . P(A)
joint
distribution
Avishek Anand
conditional
3
Probability Basics : Conditional Independence
A = X Studies CS
B = X knows programming
C = X plays cricket
A and B dependent
B and C independent
P(B,C) = P(B) . P(C)
P(B | C) = P(B)
P(A,B) = P(A | B) . P(B)
= P(B | A) . P(A)
joint
distribution
conditional
chain rule of prob.
P(A,B,C) = P(A | B, C) . P(B,C) = P(A | B, C) . P(B | C). P (C)
Avishek Anand
3
Probability Basics : Conditional Independence
A = X Studies CS
B = X knows programming
C = X plays cricket
A and B dependent
B and C independent
P(B,C) = P(B) . P(C)
P(B | C) = P(B)
P(A,B) = P(A | B) . P(B)
= P(B | A) . P(A)
joint
distribution
conditional
chain rule of prob.
P(A,B,C) = P(A | B, C) . P(B,C) = P(A | B, C) . P(B | C). P (C)
P(A | B,C) = P(A | B)
A is conditionally independent to C given B
Avishek Anand
3
Probability Basics : Conditional Independence
A = X Studies CS
B = X knows programming
C = X plays cricket
A and B dependent
B and C independent
P(B,C) = P(B) . P(C)
P(B | C) = P(B)
P(A,B) = P(A | B) . P(B)
= P(B | A) . P(A)
joint
distribution
conditional
chain rule of prob.
P(A,B,C) = P(A | B, C) . P(B,C) = P(A | B, C) . P(B | C). P (C)
P(A | B,C) = P(A | B)
A is conditionally independent to C given B
P(A,B,C) = P(A | B) . P(B). P (C)
Avishek Anand
3
Probability Basics : Bayes Theorem
P(A,B) = P(A | B) . P(B) = P(B | A) . P(A)
bayes’ theorem
P(A | B) = P(B | A) . P (A) / P(B)
posterior
prior
prob.
prob.
Basis for all bayesian inferencing
What is the P(“holmes” | “sherlock”) given that
P (“sherlock”, “holmes”), P (“sherlock”) are known ?
Avishek Anand
4
Parameter Estimation
Given observed data D
Assume that you have a probability density func. p(x)
with parameters ✓ which generates it
parameter estimation problem
What are the value of the parameters which best explain my
data ?
20
Say you feel that D is generated
from a Normal distribution
15
10
5
✓ = N (µ, )
0
0
Avishek Anand
5
4
8
12
16
Maximum Likelihood Estimator (MLE)
Say you feel that D is generated
from a Normal distribution
20
15
10
✓ = N (µ, )
5
0
0
4
8
12
The likelihood of observing this independent and
identically distributed sample is
parameters
L(✓ | x = (x1 , x2 , . . . , xn )) = p(x | ✓)
observed data
to be
estimated
MLE chooses
Avishek Anand
✓ˆ maximizes this likelihood of generating the observed data
6
16
Maximum Likelihood Estimator (MLE)
•
•
Desired probability distribution is the one that makes the
observed data ‘‘most likely,’’ which means that one must seek the
value of the parameter vector ✓ˆ that maximizes the likelihood
ˆ
function L(✓|x)
For computational convenience, the MLE estimate is obtained by
ˆ )
maximizing the log-likelihood function ln( L(✓|x)
ˆ )=
ln( L(✓|x)
•
X
xi
ln p(xi | ✓)
If log-likelihood is differentiable find ✓ˆ partial derivatives
Avishek Anand
7
Maximum Likelihood Estimator (MLE)
•
Desired probability distribution is the one that makes the
observed data ‘‘most likely,’’ which means that one must seek the
value of the parameter vector ✓ˆ that maximizes the likelihood
ˆ
function L(✓|x)
p(x = (x1 , x2 , . . . , xn ) | ✓) = p(x1 |✓) . p(x2 |✓) . . . p(xn |✓)
•
For computational convenience, the MLE estimate is obtained by
ˆ )
maximizing the log-likelihood function ln( L(✓|x)
ˆ )=
ln( L(✓|x)
•
X
xi
ln p(xi | ✓)
If log-likelihood is differentiable find ✓ˆ partial derivatives
Avishek Anand
7
Maximum Likelihood Estimator (MLE)
L(✓ | x = (x1 , x2 , . . . , xn )) =
likelihood
•
Y
xi
ˆ )=
ln( L(✓|x)
p(xi | ✓)
xi
ln p(xi | ✓)
log likelihood
Find maxima by using partial derivatives i.e.
Avishek Anand
X
8
@ ln L(✓ | x)
=0
@ ✓i
Maximum Likelihood Estimator (MLE)
L(✓ | x = (x1 , x2 , . . . , xn )) =
likelihood
Y
xi
ˆ )=
ln( L(✓|x)
p(xi | ✓)
X
xi
ln p(xi | ✓)
log likelihood
•
Find maxima by using partial derivatives i.e.
•
Example :
@ ln L(✓ | x)
=0
@ ✓i
20
15
10
5
0
0
Avishek Anand
8
4
8
12
16
Maximum Likelihood Estimator (MLE)
L(✓ | x = (x1 , x2 , . . . , xn )) =
likelihood
Y
xi
ˆ )=
ln( L(✓|x)
p(xi | ✓)
•
•
Example :
gaussian
generative
xi
ln p(xi | ✓)
log likelihood
Find maxima by using partial derivatives i.e.
p(xi | (µ, )) =
X
1
p e
2⇡
model
(xi µ)2
2 2
@ ln L(✓ | x)
=0
@ ✓i
20
15
10
5
0
0
Avishek Anand
8
4
8
12
16
Maximum Likelihood Estimator (MLE)
L(✓ | x = (x1 , x2 , . . . , xn )) =
likelihood
Y
xi
ˆ )=
ln( L(✓|x)
p(xi | ✓)
•
•
Example :
gaussian
generative
1
p e
2⇡
model
(xi µ)2
2 2
@ ln L(✓ | x)
=0
@ ✓i
20
15
10
5
@ ln L(✓ | x)
=0
@µ
Avishek Anand
xi
ln p(xi | ✓)
log likelihood
Find maxima by using partial derivatives i.e.
p(xi | (µ, )) =
X
0
0
8
4
8
12
16
Maximum Likelihood Estimator (MLE)
L(✓ | x = (x1 , x2 , . . . , xn )) =
likelihood
Y
xi
ˆ )=
ln( L(✓|x)
p(xi | ✓)
•
•
Example :
gaussian
generative
1
p e
2⇡
(xi µ)2
2 2
model
@ ln L(✓ | x)
=0
@
@ ln L(✓ | x)
=0
@µ
Avishek Anand
xi
ln p(xi | ✓)
log likelihood
Find maxima by using partial derivatives i.e.
p(xi | (µ, )) =
X
8
@ ln L(✓ | x)
=0
@ ✓i
20
15
10
5
0
0
4
8
12
16
Maximum Likelihood Estimator (MLE)
L(✓ | x = (x1 , x2 , . . . , xn )) =
likelihood
Y
xi
ˆ )=
ln( L(✓|x)
p(xi | ✓)
•
•
Example :
gaussian
generative
1
p e
2⇡
(xi µ)2
2 2
model
@ ln L(✓ | x)
=0
@
@ ln L(✓ | x)
=0
@µ
•
@ ln L(✓ | x)
=0
@ ✓i
20
15
10
5
0
0
What are the estimated parameters of the gaussian ?
Avishek Anand
xi
ln p(xi | ✓)
log likelihood
Find maxima by using partial derivatives i.e.
p(xi | (µ, )) =
X
8
4
8
12
16
Find a distribution for Coin Tosses
•
•
A coin E = {H,T} has a probability of p(H) = p, p(T) = 1-p
Lets say in 3 tosses, what is the probability you get the following sequence:
•
•
•
•
H, T, T
p (1-p) (1-p)
T,H,T
p (1-p) (1-p)
What is the probability for getting exactly 1 head ?
3. p (1-p) (1-p)
What is the probability of getting k heads in n throws ?
binomial distribution
✓ ◆
n k
p . (1
k
prob. of k heads
ways in which k
heads are chosen
Avishek Anand
p)
(n k)
9
Multinomial Distribution
•
•
•
Lets try to find a distribution for text, say a document
You are given a document D with
•
tf (“game”) = 5, tf (“of”) = 15, tf (“thrones”) = 5, tf (“arya”) = 3, tf
(“stark”) = 4, ….
•
E = {all words in D} = {“game”, “thrones”, “stark”,….}
Multinomial distribution best encodes the generative process of text
term frequency
multinomial
distribution
✓
n
tf1 , tf2 , . . . , tf|E|
◆
of the word
p(x1 )
tf1
. . . . p(x|E| )
prob. of the word
occurrence
Avishek Anand
10
tf|E|
Language Model for a Document
•
What are the parameters to be estimated and how can we estimate
them ?
total words
✓
in Doc
|D|
tf1 , tf2 , . . . , tf|V |
◆
p(x1 )
tf1
term frequency
of the word
Avishek Anand
11
. . . . p(x|V | )
tf|V |
multinomial
distribution
Language Model for a Document
•
What are the parameters to be estimated and how can we estimate
them ?
total words
✓
observed
in Doc
|D|
tf1 , tf2 , . . . , tf|V |
◆
values
p(x1 )
tf1
term frequency
of the word
Avishek Anand
11
. . . . p(x|V | )
tf|V |
multinomial
distribution
Language Model for a Document
•
What are the parameters to be estimated and how can we estimate
them ?
total words
✓
observed
in Doc
|D|
tf1 , tf2 , . . . , tf|V |
Avishek Anand
◆
values
p(x1 )
tf1
. . . . p(x|V | )
parameter
term frequency
to be
of the word
estimated
11
tf|V |
multinomial
distribution
Language Model for a Document
•
What are the parameters to be estimated and how can we estimate
them ?
total words
✓
•
observed
in Doc
|D|
tf1 , tf2 , . . . , tf|V |
◆
values
p(x1 )
tf1
. . . . p(x|V | )
tf|V |
multinomial
distribution
parameter
term frequency
to be
of the word
estimated
The MLE for each of the word prob. p(x) is the most natural estimate
tf1
p(xi ) =
|D|
Avishek Anand
11
Statistical Language Models
•
•
Models to describe language generation
Traditional NLP applications: Assigns a probability value to a sentence
•
•
•
•
•
Machine Translation — P(high snowfall) > P(large snowfall)
Spelling Correction — P(in the vineyard) > P(in the vinyard)
Speech Recognition — P(I saw a van) >> P(eyes awe of an)
Question Answering
Goal: compute the probability of a sentence or sequence of words:
•
P(S) = P(w1,w2,w3,w4,w5...wn)
A model that computes P(S) is called a language model
Avishek Anand
12
Statistical Language Models
•
How do we compute P(S) ?
•
P(S) = P(w1,w2,w3,w4,w5,…)
chain rule of prob.
P(w1,w2,w3,w4, w5,..) = P(w1 | w2, w3…). P(w2 | w3,..)...
•
Unigram Model
•
•
•
P(w1 | w2, w3…) = p (w1)
P(w1,w2,w3,w4, w5,..) = P(w1). P(w2)…
Bi-gram Model
•
•
small model
P(w1,w2,w3,w4, w5,..) = P(w1 | w2). P(w2 | w3)…
n-gram models : store n-gram dependencies
Avishek Anand
13
big model
Language Models in IR
•
Unigram Model typically used in IR
•
•
•
P(w1,w2,w3,w4, w5,..) = P(w1). P(w2)…
Build a language model for each document D — { P (wi | D) }
Ranking:
•
Query Likelihood: Given a query Q, find the relevance of D (rank acc
to P (D|Q) )
•
KL-Divergence Model:
•
•
Avishek Anand
Build language model for the query P (w | Q)
Rank acc. to KL (D || Q)
compares two distributions
14
Ranking using Query Likelihood
•
Query Likelihood: Given a query Q, find the relevance of D (rank acc to P
(D|Q) )
P (Q | D).P (D)
P (D|Q) =
P (Q)
P (D|Q) / P (Q | D).P (D)
Avishek Anand
15
Ranking using Query Likelihood
•
Query Likelihood: Given a query Q, find the relevance of D (rank acc to P
(D|Q) )
P (Q | D).P (D)
P (D|Q) =
P (Q)
constant for
all docs
P (D|Q) / P (Q | D).P (D)
Avishek Anand
15
Ranking using Query Likelihood
•
Query Likelihood: Given a query Q, find the relevance of D (rank acc to P
(D|Q) )
Authority of the docs.
determined by Page
Rank etc.
P (Q | D).P (D)
P (D|Q) =
P (Q)
constant for
all docs
P (D|Q) / P (Q | D).P (D)
Avishek Anand
15
Ranking using Query Likelihood
•
Query Likelihood: Given a query Q, find the relevance of D (rank acc to P
(D|Q) )
Authority of the docs.
determined by Page
Rank etc.
P (Q | D).P (D)
P (D|Q) =
P (Q)
constant for
all docs
P (D|Q) / P (Q | D).P (D)
Assuming all documents are equally likely
P (D|Q) / P (Q | D)
P (D|Q) /
Avishek Anand
Y
wi 2Q
15
P (wi |D)
unigram language
model
Language Model for a Document
•
Unigram Language Model provides a probabilistic model for representing text
in Information retrieval
D1
D2
Avishek Anand
p(
| D1) = 1/4 p(
| D1) = 1/4
p(
| D1) = 1/8 p(
| D1) = 3/8
p(
| D1) = 1/4 p(
| D1) = 1/8
p(
| D1) = 3/8 p(
| D1) = 1/4
16
Language Model for a Document
•
Unigram Language Model provides a probabilistic model for representing text
in Information retrieval
D1
D2
p(
| D1) = 1/4 p(
| D1) = 1/4
p(
| D1) = 1/8 p(
| D1) = 3/8
p(
| D1) = 1/4 p(
| D1) = 1/8
p(
| D1) = 3/8 p(
| D1) = 1/4
query
p(
Avishek Anand
| D1) = 3/32
16
Language Model for a Document
•
Unigram Language Model provides a probabilistic model for representing text
in Information retrieval
D1
D2
p(
| D1) = 1/4 p(
| D1) = 1/4
p(
| D1) = 1/8 p(
| D1) = 3/8
p(
| D1) = 1/4 p(
| D1) = 1/8
p(
| D1) = 3/8 p(
| D1) = 1/4
query
p(
Avishek Anand
query
p(
| D1) = 3/32
16
| D2) = 1/32
Language Model for a Document
•
Unigram Language Model provides a probabilistic model for representing text
in Information retrieval
D1
D2
p(
| D1) = 1/4 p(
| D1) = 1/4
p(
| D1) = 1/8 p(
| D1) = 3/8
p(
| D1) = 1/4 p(
| D1) = 1/8
p(
| D1) = 3/8 p(
| D1) = 1/4
query
p(
Avishek Anand
query
p(
| D1) = 3/32
16
| D2) = 1/32
D1 > D2
Example - Language Model + MLE
•
Unigram Language Model provides a probabilistic model for representing
text
•
The MLE for each of the word prob. p(x) is the most natural estimate
tf1
p(xi ) =
|D|
Avishek Anand
17
Zero Probability Problem
•
•
What if some of the queried terms are absent in the document ?
MLE based estimation results in a zero probability for query generation
•
•
P(“frog”, “ape” | M1) = 0.01x 0
P(“frog”, “ape” | M2) = 0.00021x 0
•
Need to smooth the probability estimates for terms to avoid zero
probabilities
•
Take the prob. mass from each term and redistribute among missing terms
Avishek Anand
18
Smoothing Methods
•
Jelinek-Mercer Smoothing : Linear combination of document and
corpus statistics to estimate term probabilities
P (Q|D) =
Y
wi 2Q
doc. contrib.
corpus contrib.
.P (wi |D) + (1
).P (wi |C)
•
Collection frequency: fraction of occurence of term in the entire
collection
•
Document frequency: fraction of document occurence of term in the
entire collection
Avishek Anand
19
Smoothing Methods
•
Jelinek-Mercer Smoothing : Linear combination of document and
corpus statistics to estimate term probabilities
P (Q|D) =
Y
wi 2Q
doc. contrib.
corpus contrib.
.P (wi |D) + (1
).P (wi |C)
collection freq. or
document fréquency
•
Collection frequency: fraction of occurence of term in the entire
collection
•
Document frequency: fraction of document occurence of term in the
entire collection
Avishek Anand
19
Smoothing Methods
•
Jelinek-Mercer Smoothing : Linear combination of document and
corpus statistics to estimate term probabilities
P (Q|D) =
Y
wi 2Q
doc. contrib.
corpus contrib.
.P (wi |D) + (1
).P (wi |C)
param. regulates
collection freq. or
contribution
document fréquency
•
Collection frequency: fraction of occurence of term in the entire
collection
•
Document frequency: fraction of document occurence of term in the
entire collection
Avishek Anand
19
Smoothing Methods
•
Smoothing with Dirichlet Prior:
term freq
of word in
Doc
P (Q|D) =
Y tf (wi ; D) + µ.P (wi |C)
|D| + µ
wi 2Q
•
•
Takes the corpus distribution as a prior to estimating the prob. for terms
works well for short queries
Avishek Anand
20
Smoothing Methods
•
Smoothing with Dirichlet Prior:
term freq
of word in
Doc
P (Q|D) =
Y tf (wi ; D) + µ.P (wi |C)
|D| + µ
wi 2Q
dirichlet prior
•
•
Takes the corpus distribution as a prior to estimating the prob. for terms
works well for short queries
Avishek Anand
20
Smoothing Methods
•
Smoothing with Dirichlet Prior:
term freq
of word in
collection freq. or
document fréquency
Doc
P (Q|D) =
Y tf (wi ; D) + µ.P (wi |C)
|D| + µ
wi 2Q
dirichlet prior
•
•
Takes the corpus distribution as a prior to estimating the prob. for terms
works well for short queries
Avishek Anand
20
Temporal Ranking
•
Queries with temporal expressions
•
•
fifa world cup in 1998
Documents also mention temporal expressions
•
•
“Clinton was the president in 1990s”
“Interstellar was slated for release in july 2014”
P (Q|D) = P (Qtext |Dtext ).P (Qtime |Dtime )
•
•
Assume time mentions are independent of text
Assume temporal expressions are independent of each other
How do we compute P (Qtime |Dtime )
Avishek Anand
21
References and Further Readings
•
Informa)on retrieval: (h3p://www.ir.uwaterloo.ca/book/) • Stefan Bü3cher, Google Inc. , Charles L. A. Clarke, Univ. of Waterloo, Gordon V. Cormack, Univ. of Waterloo •
Founda)ons of Informa)on retrieval: Manning, Schutze, Raghavan
Avishek Anand

Download Report

Probability Basics and Language Models

Paperzz.com

Your Paperzz