Basic probability theory and n-gram models - INF4820 – H2010

Basic probability
Bayes theorem
Language modeling
Basic probability theory and n-gram models
INF4820 – H2010
Jan Tore Lønning
Institutt for Informatikk
Universitetet i Oslo
28. september
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Outline
1
Basic probability
2
Bayes theorem
3
Language modeling
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Outline
1
Basic probability
2
Bayes theorem
3
Language modeling
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Gambling — the beginning of probabilities
Example
Throwing a fair dice
What is the chance of getting a 5?
What is the chance of getting an odd number?
What is the chance of getting a number ≥ 3?
Example
Two throws
What is the chance of getting two 5s?
What is the chance of getting the sum 10?
What is the chance of getting 6 or more if the first dice is 1
or 2?
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Some terminology
The sample space is a set Ω of elementary outcomes (No:
“utfallsrom”).
For example, let Ω represent the outcome of throwing a die:
Ω = {one, two, three, four , five, six}
Events (A, B, C, . . .) are subsets of this set, (No:
“hendelse, begivenhet”)
such as {one}, {two, four , six}, {three, six},etc.
A probability measure P is a function from events to the
interval [0, 1].
P(A) is the probability of event A.
With the fair dice: P({two}) = 16 , P({two, four , six}) = 12 ,
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Some Basic Rules
The three axioms of probability theory...
P(A) ≥ 0 for all events A (non-negativity)
P(Ω) = 1 (unit measure)
A∩B =∅ ⇒
disjoint events)
P(A ∪ B) = P(A) + P(B) (additivity for
And some of their consequences
P(Ā) = 1 − P(A)
P(∅) = 0
If A ⊆ B then P(A) ≤ P(B)
P(A ∪ B) = P(A) + P(B) − P(A ∩ B) (the addition rule)
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Joint Probability and Independence
Joint probability
The joint probability of two events A and B is the probability
of both events occurring, P(A ∩ B).
Example
a throwing two dices, A is the event of the first being 3, B is
the event of the second being 4 or 6.
b two dices, the first is odd the sum of the two is even
c throwing one dice, A it is odd, B it is ≥ 4.
Independence
Two events A and B are independent if
P(A ∩ B) = P(A)P(B).
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Conditional Probability and Independence
The notion of conditional probability lets us capture the
influence of partial knowledge about the outcome.
The conditional probability of A given B is defined as
P(A|B) =
P(A ∩ B)
P(B)
The probability that we will observe A given that we have
already observed B. The fraction of B’s probability mass
that also covers A.
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Conditional Probability and Independence (cont’d)
The Multiplication Rule
The numerator in our equation for conditional probability
can itself be “conditionalized” using the multiplication rule:
P(A ∩ B) = P(A)P(B|A) = P(B)P(A|B)
The Chain Rule
Generalizes the multiplication rule to multiple events:
P(A ∩ B ∩ C ∩ D ∩ . . .) =
P(A)P(B|A)P(C|A ∩ B)P(D|A ∩ B ∩ C) . . .
Two events A and B are independent
(P(A ∩ B) = P(A)P(B)) if and only if P(A|B) = P(A).
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Random Variables
Instead of dealing with with the sample space directly
(which will be different for each application), random
variables provides an extra layer of abstraction that lets us
deal with events in a more uniform way.
A discrete random variable X is a function from the sample
space Ω into a finite or countable infinite set of values.
The probability mass function (pmf) p for a random variable
X gives us the probability of a given numerical value:
p(x) = p(X = x) = P ({ω ∈ Ω : X (ω) = x})
For a discrete random variable X , let xi be the value
corresponding to the event Axi in Ω. We then have that:
X
X
P(Axi ) = P(Ω) = 1
p(X = xi ) =
xi ∈X
Axi ∈Ω
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Outline
1
Basic probability
2
Bayes theorem
3
Language modeling
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Bayes teorem
P(A | B) =
P(A ∩ B)
betinget sannsynlighet
P(B)
P(A ∩ B) = P(A | B)P(B) produktregelen
P(A ∩ B) = P(B | A)P(A)
P(A | B)P(B) = P(B | A)P(A)
P(A | B) =
P(B | A)P(A)
Bayes’ teorem
P(B)
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Bayes teorem
Example (fra Wikipedia)
40% jenter, 60% gutter.
Alle guttene bruker bukser.
50% av jentene bruker bukser.
Hva er sannsynligheten for at en som går i bukser er jente?
Example (Løsning med Bayes)
Sannsynlighet for at noen er jenter, P(A) = 0, 4
Sannsyn. for at en jente bruker bukser, P(B | A) = 0, 5.
Sannsyn. for å gå i bukser, P(B) = P(B | A)P(A)+
P(B | A)P(A) = 0, 5 × 0, 4 + 1 × 0, 6 = 0, 8
P(A | B) =
P(B|A)P(A)
P(B)
=
0,5×0,4
0,8
Jan Tore Lønning
= 0, 25
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Litt sjargong
P(A | B) =
P(B|A)
P(B) P(A)
P(A) prior sannsynlighet
I eksempelet: sannsynligheten for at det er en jente før vi
vet hun går i bukser.
P(A | B) posterior sannsynlighet
I eksempelet: sannsynligheten for at det er en jente etter at
vi har fått vite at vedkommende går i bukser.
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Argmax-notasjon
Definition (Argmax)
x0 = arg max f (x)
vil si at
x
f (y ) ≤ f (x0 )
Argmax og Bayes
x0 = arg max P(x | B)
x
P(B | x)
P(x)
P(B)
x
= arg max P(B | x)P(x)
= arg max
x
= arg max [log P(B | x) + log P(x)]
x
Jan Tore Lønning
Probabilities and n-grams
for alle y
Basic probability
Bayes theorem
Language modeling
Some mathematical notation
Sum and product
n
X
ai = a1 + a2 + · · · + an
i=1
n
Y
ai = a1 · a2 + · · · an
i=1
Example
P7
i=1
= 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28
Q7
= 1 · 2 · 3 · · · 7 = 5040
Qn
P
log( i=1 ai ) = ni=1 log(ai )
i=1 i
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Outline
1
Basic probability
2
Bayes theorem
3
Language modeling
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Language Modeling
A stochastic (or probabilistic) language model M is a
model that assign probabilities PM (x) to all strings x ∈ L.
Remember the properties of PM :
L is the sample space
0
P≤ PM (x) ≤ 1
x∈L PM (x) = 1
Used in NLP:
1
2
3
4
5
Speech recognition
Machine translation
Spell checking (see Nordvig on the net)
OCR
Tagging
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Eksempel MT
Vi er interessert i å finne den beste engelske oversettelsen
Ê av en norsk (fremmed) setning F .
Ê
= arg max P(E | F )
E
P(F | E)
P(E)
P(F )
E
= arg max P(F | E)P(E)
= arg max
E
Vi snur på flisa: betrakter F som en oversettelse
(forvanskning) av E og spør hvilken E:
Utgangspunkt for forenkling/approksimasjoner.
Tilgang til P(E) = språkmodell.
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Bigram model
By the chain rule of joint probabilities, the probability of a
sequence of words can be factorized as:
p(w1 , . . . , wk ) =
p(w1 )p(w2 |w1 )p(w3 |w1 , w2 ) . . . p(wk |w1 , . . . , wk −1 )
However, modeling this full distribution would require too
many parameters...
We simplify the estimation problem by the so-called
Markov assumption: The probability of a given word is
taken to only depend on the preceding word:
p(wj |w1 , . . . , wj−1 ) = p(wj |wj−1 )
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Bigram model
The probability of a string w1 , . . . , wk is just the product of
its individual word probabilities, computed as:
pn (w1 . . . wk ) =
k
Y
p(wi |wi−1 )
i=1
Special markers for sentence boundaries: <s> and </s>.
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
N-gram model
In a trigram model the probability of a word may depend on
two proceeding words
p(wj |w1 , . . . , wj−1 ) = p(wj |wj−2 wj−1 )
pn (w1 . . . wk ) =
k
Y
p(wi |wj−2 wi−1 )
i=1
In an N-gram model the probability of a word may depend
on N − 1 proceeding words
j−1
p(wj |w1 , . . . , wj−1 ) = p(wj |wj−N+1 . . . wj−1 ) = p(wj |wj−N+1
)
pn (w1 . . . wk ) =
k
Y
i−1
p(wi |wi−N+1
)
i=1
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Estimation
How to get our hands on the probabilities?
P(bananas|i, like) =
C(i, like, bananas)
C(i, like, ∗)
The probabilities are given by the relative frequencies of
the observed outcomes, i.e. as they occur in our training
corpus.
= Maximum likelihood estimation / MLE
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Some Considerations When Counting Words
What is to count as a word depends on the application.
Tokenization
Normalization of case, spelling variants, clitics,
abbreviations, numerical expressions, punctuation...
Lemmatization. Base forms vs full word forms.
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Evaluation
Extrinsic Evaluation
AKA application-based or end-to-end evaluation
Measure the effect on performance within the embedding
application (e.g. the MT-system or the speech-recognizer
that the LM is a part of).
Intrinsic
Quick and cheap evaluation based on the probability that
the model assigns to unseen test data.
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Training Data vs Test Data
We use the training set to estimate our models and use the
test set to evaluate them.
Testing on the training data would give unrealistically high
expectations about performance in a real application.
Even with a perfect score when testing on the training data,
we would probably see a drastic fall in performance on
unseen data.
Related: Overfitting. Why is the MLE a poor estimator?
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Problems
1
Data sparseness
Regardless of corpus size, perfectly acceptable phrases will
be missing.
The creativity of language.
2
Unknown words
3
Zero counts (mixes badly with the factorization into word
probabilities)
Some remedies
1
Make sure all n-grams receive a non-zero count
(smoothing).
2
Make provisions for out-of-vocabulary words (OOVs).
Jan Tore Lønning
Probabilities and n-grams
Basic probability
Bayes theorem
Language modeling
Smoothing
General idea: take some of the probability mass of
frequent events, and redistribute it to less frequent or
unseen events.
Simplest approach: Add-One smoothing:
C(w
...wi−1 wi )+1
P̂(wi | wi−n+1 . . . wi−1 ) = C(wi−n+1
i−n+1 ...wi−1 )+N
where N are the number of words
Other LM techniques aimed at overcoming problems with
unseen events:
Back-off and interpolated models
Class-based models
Jan Tore Lønning
Probabilities and n-grams