pcfg

Probabilistic and Lexicalized
Parsing
1
Handling Ambiguities
• The Earley Algorithm is equipped to represent
ambiguities efficiently but not to resolve them.
• Methods available for resolving ambiguities
include:
– Semantics (choose parse that makes sense).
– Statistics: (choose parse that is most likely).
• Probabilistic context-free grammars (PCFGs) offer
a solution.
2
Issues for PCFGs
• The probablistic model
• Getting the probabilities
• Parsing with probabilities
– Task: find max probability tree for an input
string
3
PCFG
• A PCFG is a 5-tuple (NT,T,P,S,D) where D is a
function that assigns probabilities to each rule p  P
• A PCFG augments each rule with a conditional
probability.
A  [p]
• Formally this is the probability of a given expansion
given LHS non-terminal.
– Read this as P(Specific rule | LHS)
– VP -> V
.55
– What’s the probability that VP will expand to V, given
that we have a VP?
4
Example PCFG
5
Example PCFG Fragment
• S NP VP [.80]
S Aux NP VP [.15]
S VP [.05]
• Sum of conditional probabilities for a given
ANT = 1
• PCFG can be used to estimate probabilities
of each parse-tree for sentence S.
6
Probability of a Derivation
• For sentence S, the probability assigned by a
PCFG to a parse tree T is given by
 P(r(n))
• i.e. the product of the probabilities of all the rules r
used to expand each node n in T
• Note the independence assumption
7
Ambiguous Sentence
P(TL) = 1.5 x 10-6
P(TR) = 1.7 x 10-6
P(S) = 3.2 x 10-6
8
Getting the Probabilities
• From an annotated database
– E.g. the Penn Treebank
– To get the probability for a particular VP rule
just count all the times the rule is used and
divide by the number of VPs overall
– What if you have no treebank (e.g. for a ‘new’
language)?
9
The Parsing Problem for PCFGs
• The parsing problem for PCFGs is to
produce the most likely parse for a given
sentence, i.e. to compute the T spanning the
sentence whose probability is maximal.
10
Typical Approach
• Bottom-up (CYK) dynamic programming approach
• Assign probabilities to constituents as they are completed
and placed in the table
• Use the max probability for each constituent going up the
tree
• CYK algorithm assumes that grammar is in Chomsky
Normal Form:
– No  productions
– Rules of the form A B C or A  a
11
CKY Algorithm – Base Case
• Base case: covering input strings with of
length 1 (i.e. individual words). In CNF,
probability p has to come from that of
corresponding rule
• A  w [p]
12
CKY Algorithm
Recursive Case
• Recursive case: input strings of length > 1:
A * wij if and only if there is a rule
ABC
and some 1 ≤ k ≤ j such that B derives wik and C
derives wkj
• In this case P(wij) is obtained by multiplying
together P(wik) and P(wjk).
• These probabilities in other parts of table
• Take max value
13
Problems with PCFGs
• Fundamental Independence Assumptions:
A CFG assumes that the expansion of any
one non-terminal is independent of the
expansion of any other non-terminal.
• Hence rule probabilities are multiplied
together.
14
Problems with PCFGs (cont.)
• Doesn’t use the words in any real way – e.g.
PP attachment often depends on the verb, its
object, and the preposition (I ate pickles
with a fork. I ate pickles with relish.)
• Doesn’t take into account where in the
derivation a rule is used – e.g. pronouns
more often subjects than objects (She hates
Mary. Mary hates her.)
15
Dependencies cannot be stated
• These dependencies could be captured if it
were possible to say that the probabilities
associated with, e.g.
NP → Pronoun or
NP → Det Noun
depend on whether NP is subject or object
• However, this cannot normally be said in a
standard PCFG.
16
Lack of Sensitivity to the properties
of individual Words
• Lexical information can play an important
role in selecting between alternative
interpretations. Consider sentence:
• "Moscow sent soldiers into Afghanistan."
• NP → NP PP
VP → VP PP
• These give rise to 2 parse trees
17
PP Attachment Ambiguity
33% of PPs attach to VPs
67% of PPs attach to NPs
S
S
VP
VP
VP
VP
PP
NP
VP
NP
N
V
N
P
NP
NP
N
N
Moscow sent soldiers into Afghanistan
NP
NP
PP
NP
V
N
P
N
Moscow sent soldiers into Afghanistan
18
Lexical Properties
• In this case the raw statistics are misleading
and yield the wrong conclusion.
• The correct parse should be decided on the
basis of the lexical properties of the verb
"send into" alone, since we know that the
basic pattern for this verb is
(NP) send (NP) (PPinto)
where the PPinto attaches to the VP.
19
Lexicalised PCFGs
• Add lexical dependencies to the scheme…
– Add the predilections of particular words into the
probabilities in the derivation
– I.e. Condition the rule probabilities on the actual words
• Basic idea: each syntactic constituent is associated
with a head which is a single word.
• Each non-terminal in a parse tree is annotated with
that single word.
20
Phrasal Heads
– Heads
• The head of an NP is its noun
• The head of a VP is its verb
• The head of a PP is its preposition
– Can ‘take the place of’ whole phrases, in some sense
– Define most important characteristics of the phrase
– Phrases are generally identified by their heads
21
Lexicalised Tree
22
Example (wrong)
23
How?
• We started with rule probabilities
– VP -> V NP PP
P(rule|VP)
• E.g., count of this rule divided by the number of VPs in a
treebank
• Now we want lexicalized probabilities
– VP(dumped)-> V(dumped) NP(sacks)PP(into)
– P(r|VP ^ dumped is the verb ^ sacks is the head of the
NP ^ into is the head of the PP)
• Not likely to have significant counts in any treebank
24
Probabilities of Lexicalised Rules
• Need to establish probabilities of, e.g.
VP(dumped) → V(dumped) NP(sacks) PP(into)
VP(dumped) → V(dumped) NP(cats) PP(into)
VP(dumped) → V(dumped) NP(hats) PP(into)
VP(dumped) → V(dumped) NP(sacks) PP(above)
• Problem – no corpus big enough to train with this
number of rules
25
Declare Independence
• So, exploit independence assumption and
collect the statistics you can…
• Focus on capturing two things
– Verb subcategorization
• Particular verbs have affinities for particular VP rules
– Objects have affinities for their predicates
(mostly their mothers and grandmothers)
• Some objects fit better with some predicates than
others
26
Verb Subcategorization
• Condition particular VP rules on their head… so
r: VP -> V NP PP P(r|VP)
Becomes
P(r | VP ^ dumped)
What’s the count?
How many times was this rule used with dump, divided
by the number of VPs that dump appears in total
27
Preferences
• The issue here is the attachment of the PP
• So the affinities we care about are the ones
between dumped and into vs. sacks and into.
– Count the times dumped is the head of a
constituent that has a PP daughter with into as its
head and normalize
– Vs. the situation where sacks is a constituent with
into as the head of a PP daughter
28