22-09-05

An introduction to computational psycholinguistics:
Modeling human sentence processing
Shravan Vasishth
University of Potsdam, Germany
http://www.ling.uni-potsdam.de/∼vasishth
[email protected]
September 2005, Bochum
Probabilistic models: (Crocker & Keller, 2005)
• In ambiguous sentences, a preferred interpretation is immediately assigned, with later
backtracking to reanalyze.
(1)
a. The horse raced past the barn fell. (Bever 1970)
b. After the student moved the chair broke.
c. The daughter of the colonel who was standing by the window.
• On what basis do humans choose one over the other interpretation?
• One plausible answer: experience. (Can you think of some others?)
• We’ve seen an instance of how experience could determine parsing decisions:
connectionist models.
1
The role of linguistic experience
• Experience: the number of times the speaker has encountered a particular entity in the
past.
• It’s impractical to measure or quantify experience of a particular entity based on the
entire set of linguistic items seen by a speaker; but we can estimate them through (e.g.)
corpora, norming (e.g., sentence completion) studies. Consider the S/NP ambiguity of
the verb “know”:
• There is a reliable correlation between corpora and norming studies (Lapata, Keller, &
Schulte im Walde, 2001).
(2) The teacher knew . . .
• The critical issue is how the human processor uses this experience to resolve ambiguities,
and at what level of granularity experience plays a role (lexical, syntactic structure,
verb frames).
2
The granularity issue
• It’s clear that lexical frequencies play a role. But are frequencies used at the lemma
level or the token level? (Roland and Jurafsky 2002)
• Structural frequencies: do we use frequencies of individual phrase structure rules?
Probabilistic modelers say: yes.
3
Probabilistic grammars
• Context-free grammar rules:
S -> NP VP
NP -> Det N
• Probabilities associated with each rule, derived from a (treebank) corpus:
1.0 S -> NP VP
0.7 NP -> Det N
0.3 NP -> Nplural
• A normalization constraint on the PCFG: the probabilities of all rules with the same
LHS must sum to 1. (See appendix A of Hale 2003).
Y
i
j
∀i P (N → ζ ) = 
(1)
j
4
• The probability of a parse tree is the product of the rule probabilities.
P (t) =
Y
P (N → ζ)
(2)
(N →ζ)∈R
• Jurafsky (1996) has suggested that the probability of a grammar rule models the ease
with which the rule can be accessed by the human sentence processor.
• Example from Crocker and Keller (2005) shows how this setup can be used to predict
parse preferences.
• Further reading: Manning and Schütze, and Jurafsky and Martin.
• NLTK demo.
5
Estimating the rule probabilities and parsing
• Maximum likelihood: estimates the probability of a rule based on the number of times
it occurs in a treebank corpus.
• Expectation maximization: given a grammar, computes a set of rule probabilities that
make the sentences maximally likely.
• Viterbi algorithm for computing the best parse.
6
Linking probabilities to processing difficulty
• The goal is usually to model reading time, acceptability judgments.
• Possible measures:
– probability ratios of alternatives
– Entropy reduction during incremental parsing: very different approach from
computing probabilities of parses.
Hale (2003):
“cognitive load is related, perhaps linearly, to the reduction in the perceiver’s uncertainty
about what the producer meant.” (p. 102)
7
Some assumptions in Hale’s approach
• During comprehension, sentence understanders determine a syntactic structure for the
perceived signal.
• Producer and comprehender share the same grammar.
• Comprehension is eager; no processing is deferred beyond the first point at which it
could happen.
We’re now going to look at the mechanism he builds up, starting with the incredibly
beautiful notion of entropy. In the following discussion I rely on (Schneider, 2005).
8
An introduction to entropy
Imagine a device D that can emit three symbols A, B, C.
• Before D can emit anything, we are uncertain about which symbol among the three
possible ones it will emit. We can quantify this uncertainty and say it’s 3.
• Now a symbol appears, say, A. Our uncertainty decreases. To 2. In other words, we’ve
received some information.
• Information is decrease in uncertainty.
• Now suppose that another machine D emits 1 or 2.
• The composition of D × D results in 6 possible emissions: A1, A2, B1, B2, C1, C2.
So what’s our uncertainty now?
• It would be nice to be able to talk about increase in uncertainty additively–hence the
use of logarithms.
• D: log(3) is the new uncertainty; D log(2). D × D=log(6). When we use
base 2, the units of uncertainty are in bits, base 10 (units: digits), e (unit: nats/nits)
also possible.
9
Question
If a device emits only one symbol, what’s the uncertainty? How many bits?
10
Deriving the formula for uncertainty
Let M be the number of symbol-emissions possible. Uncertainty is now log (M ).
(From now on, log means log .)
log M = − log(M
= − log(
−
)
(3)
1
)
M
= − log(P )
(4)
. . . [Letting P =
1
]
M
(5)
(6)
Let P i be the various probabilities of the symbols, such that
M
X
Pi = 
(7)
i=1
11
Surprise
The surprise we get when we see the ith symbol is called “surprisal”.
ui = − log(P i )
(8)
if P i =  then ui = ∞ . . . we are very surprised
if P i =  then ui =  . . . we are not surprised
Average surprisal for an infinite string of symbols:
Assume a string of length N, M symbols. Let the ith symbol appear N i times:
N =
M
X
Ni
(9)
i=1
12
Since there are N i cases of ui , average surprisal is:
M
P
N i × ui
M
X
N i × ui
=
P
N
Ni
i=1
i=1
(10)
i=1
if we measure this for an infinite string of symbols,
N
frequency Ni approaches probability P i .
M
X
P i × ui
H =
(11)
i=1
M
X
=−
P i × P log P i
(12)
i=1
(13)
13
That is Shannon’s formula for uncertainty. Search on Google for Claude Shannon, and
you will find the original paper. Read it and memorize it.
14
Exercise
Suppose P  = P  = · · · = P M .
M
X
P i log P i =?
H =−
(14)
i=1
15
Solution
Suppose P  = P  = · · · = P M . Then P i =

M
for i = 1 · · · m.
»
–
M
X




P i log P i = −
H =−
log
+ ··· +
log
M
M
M
M
i=1
=−
1
1
× M log
M
M
= log M
(15)
(16)
(17)
Recall our earlier idea that if we have M outcomes we can express our uncertainty as
log(M ). Device D was log(3), for example.
16
Another exercise
Let p = , p = , · · · , p = . What is the average uncertainty or entropy H?
17
Hale’s approach
• Assume a grammar G.
• Let T G represent all the possible derivations of G; each derivation has a probability.
• Let W be all the possible strings in G.
Let’s represent the information conveyed by the first i words of a sentence generated
by G by: I(T G | W ···i ).
I(T G | W ···i ) =H(T G ) − H(T G | W ...i )
(18)
18
Hale’s approach
I(T G | W ···i ) = H(T G ) − H(T G | W ...i )
(19)
The above is just like our example with device D :
I(D | A) =H(D ) − H(D | A)
(20)
=3 − 2
(21)
=1
(22)
The above example with D matches our intuition: once an A has been emitted, our
uncertainty has been reduced by 1.
19
Information conveyed by a particular word
I(T G | W i = wi ) = H(T G | w...i−) − H(T G | w...i )
(23)
This gives us the information a comprehender gets from a word.
20
The entropy of a grammar symbol, say VP,
is the sum of
• the entropy of a single-rule rewrite decision
• the expected entropy of any children of that symbol
(Grenander 1967)
21
The entropy of a grammar symbol
Let the productions in a PCFG G be
Q
.
For a left-hand side ξ , the rules rewriting it are
Q
(ξ). Example:
VP -> V NP
VP -> V
Let h be a vector that is indexed by ξ i , the symbols. So we can say h(ξ ) to mean,
say, the first cell in the vector. For any grammar symbol ξ i we can compute the Entropy
using the usual formula:
h(ξ ) = −
X
pr log pr
(24)
Q
r∈ (ξ )
22
The entropy of a grammar symbol
More generally:
h(ξ i ) = −
X
pr log pr
(25)
Q
r∈ (ξi )
So now the vector h has the Entropies of each ξ i .
23
The entropy of a grammar symbol, say VP,
is the sum of
• the entropy of a single-rule rewrite decision ✔
• the expected entropy of any children of that symbol
Now we work on the second part.
24
The expected entropy of any children of a symbol
Suppose Rule r rewrites s nonterminal ξ i as n daughters.
Rule r1 VP -> V NP
Rule r2 VP -> V
ξ = V P
Q
(V P ) = r1, r2
H(V P ) = h(V P ) + [H(V r) + H(N P r ) + H(V r)]
(26)
25
The expected entropy of any children of a symbol
More generally:
H(ξ i ) = h(ξ i) +
X
[H(ξ i ) + · · · + H(ξ in )]
(27)
Q
r∈ (ξi )
26
The information conveyed by a word
I(T G | W i = wi ) = H(T G | w...i−) − H(T G | w...i )
(28)
27
Example
28
The horse raced past the barn fell
Initial conditional entropy of the parser state:H G(S) = ..
Input “the”:
Every sentence in the grammar begins with “the”, so information gained, no reduction
in entropy.
29
The horse raced
30
The horse raced past the barn fell
31
Subject relatives
32
Object relatives
33
In closing
• What are some of the similarities and differences between connectionist probabilistic
models?
• How would one (could one?) reconcile symbolic and connectionist and/or probabilistic
approaches?
34
*References
Crocker, M., & Keller, F. (2005). Probablistic grammars as models of gradience in
language processing. In G. Fanselow, C. Féry, R. Vogel, & M. Schlesewsky (Eds.),
Gradience in grammar: Generative perspectives. Oxford University Press.
Hale, J. (2003). The information conveyed by words in sentences.
Psycholinguistic Research, 32 (2), 101–123.
Journal of
Jurafsky, D. (1996). A probabilistic model of lexical and syntactic access and
disambiguation. Cognition, 20, 137–194.
Lapata, M., Keller, F., & Schulte im Walde, S. (2001). Verb frame frequency as a predictor
of verb bias. Journal of Psycholinguistic Research, 30 (4), 419-435.
Schneider, T. (2005, March). Information theory primer. Manuscript on internet.
35