An introduction to computational psycholinguistics: Modeling human sentence processing Shravan Vasishth University of Potsdam, Germany http://www.ling.uni-potsdam.de/∼vasishth [email protected] September 2005, Bochum Probabilistic models: (Crocker & Keller, 2005) • In ambiguous sentences, a preferred interpretation is immediately assigned, with later backtracking to reanalyze. (1) a. The horse raced past the barn fell. (Bever 1970) b. After the student moved the chair broke. c. The daughter of the colonel who was standing by the window. • On what basis do humans choose one over the other interpretation? • One plausible answer: experience. (Can you think of some others?) • We’ve seen an instance of how experience could determine parsing decisions: connectionist models. 1 The role of linguistic experience • Experience: the number of times the speaker has encountered a particular entity in the past. • It’s impractical to measure or quantify experience of a particular entity based on the entire set of linguistic items seen by a speaker; but we can estimate them through (e.g.) corpora, norming (e.g., sentence completion) studies. Consider the S/NP ambiguity of the verb “know”: • There is a reliable correlation between corpora and norming studies (Lapata, Keller, & Schulte im Walde, 2001). (2) The teacher knew . . . • The critical issue is how the human processor uses this experience to resolve ambiguities, and at what level of granularity experience plays a role (lexical, syntactic structure, verb frames). 2 The granularity issue • It’s clear that lexical frequencies play a role. But are frequencies used at the lemma level or the token level? (Roland and Jurafsky 2002) • Structural frequencies: do we use frequencies of individual phrase structure rules? Probabilistic modelers say: yes. 3 Probabilistic grammars • Context-free grammar rules: S -> NP VP NP -> Det N • Probabilities associated with each rule, derived from a (treebank) corpus: 1.0 S -> NP VP 0.7 NP -> Det N 0.3 NP -> Nplural • A normalization constraint on the PCFG: the probabilities of all rules with the same LHS must sum to 1. (See appendix A of Hale 2003). Y i j ∀i P (N → ζ ) = (1) j 4 • The probability of a parse tree is the product of the rule probabilities. P (t) = Y P (N → ζ) (2) (N →ζ)∈R • Jurafsky (1996) has suggested that the probability of a grammar rule models the ease with which the rule can be accessed by the human sentence processor. • Example from Crocker and Keller (2005) shows how this setup can be used to predict parse preferences. • Further reading: Manning and Schütze, and Jurafsky and Martin. • NLTK demo. 5 Estimating the rule probabilities and parsing • Maximum likelihood: estimates the probability of a rule based on the number of times it occurs in a treebank corpus. • Expectation maximization: given a grammar, computes a set of rule probabilities that make the sentences maximally likely. • Viterbi algorithm for computing the best parse. 6 Linking probabilities to processing difficulty • The goal is usually to model reading time, acceptability judgments. • Possible measures: – probability ratios of alternatives – Entropy reduction during incremental parsing: very different approach from computing probabilities of parses. Hale (2003): “cognitive load is related, perhaps linearly, to the reduction in the perceiver’s uncertainty about what the producer meant.” (p. 102) 7 Some assumptions in Hale’s approach • During comprehension, sentence understanders determine a syntactic structure for the perceived signal. • Producer and comprehender share the same grammar. • Comprehension is eager; no processing is deferred beyond the first point at which it could happen. We’re now going to look at the mechanism he builds up, starting with the incredibly beautiful notion of entropy. In the following discussion I rely on (Schneider, 2005). 8 An introduction to entropy Imagine a device D that can emit three symbols A, B, C. • Before D can emit anything, we are uncertain about which symbol among the three possible ones it will emit. We can quantify this uncertainty and say it’s 3. • Now a symbol appears, say, A. Our uncertainty decreases. To 2. In other words, we’ve received some information. • Information is decrease in uncertainty. • Now suppose that another machine D emits 1 or 2. • The composition of D × D results in 6 possible emissions: A1, A2, B1, B2, C1, C2. So what’s our uncertainty now? • It would be nice to be able to talk about increase in uncertainty additively–hence the use of logarithms. • D: log(3) is the new uncertainty; D log(2). D × D=log(6). When we use base 2, the units of uncertainty are in bits, base 10 (units: digits), e (unit: nats/nits) also possible. 9 Question If a device emits only one symbol, what’s the uncertainty? How many bits? 10 Deriving the formula for uncertainty Let M be the number of symbol-emissions possible. Uncertainty is now log (M ). (From now on, log means log .) log M = − log(M = − log( − ) (3) 1 ) M = − log(P ) (4) . . . [Letting P = 1 ] M (5) (6) Let P i be the various probabilities of the symbols, such that M X Pi = (7) i=1 11 Surprise The surprise we get when we see the ith symbol is called “surprisal”. ui = − log(P i ) (8) if P i = then ui = ∞ . . . we are very surprised if P i = then ui = . . . we are not surprised Average surprisal for an infinite string of symbols: Assume a string of length N, M symbols. Let the ith symbol appear N i times: N = M X Ni (9) i=1 12 Since there are N i cases of ui , average surprisal is: M P N i × ui M X N i × ui = P N Ni i=1 i=1 (10) i=1 if we measure this for an infinite string of symbols, N frequency Ni approaches probability P i . M X P i × ui H = (11) i=1 M X =− P i × P log P i (12) i=1 (13) 13 That is Shannon’s formula for uncertainty. Search on Google for Claude Shannon, and you will find the original paper. Read it and memorize it. 14 Exercise Suppose P = P = · · · = P M . M X P i log P i =? H =− (14) i=1 15 Solution Suppose P = P = · · · = P M . Then P i = M for i = 1 · · · m. » – M X P i log P i = − H =− log + ··· + log M M M M i=1 =− 1 1 × M log M M = log M (15) (16) (17) Recall our earlier idea that if we have M outcomes we can express our uncertainty as log(M ). Device D was log(3), for example. 16 Another exercise Let p = , p = , · · · , p = . What is the average uncertainty or entropy H? 17 Hale’s approach • Assume a grammar G. • Let T G represent all the possible derivations of G; each derivation has a probability. • Let W be all the possible strings in G. Let’s represent the information conveyed by the first i words of a sentence generated by G by: I(T G | W ···i ). I(T G | W ···i ) =H(T G ) − H(T G | W ...i ) (18) 18 Hale’s approach I(T G | W ···i ) = H(T G ) − H(T G | W ...i ) (19) The above is just like our example with device D : I(D | A) =H(D ) − H(D | A) (20) =3 − 2 (21) =1 (22) The above example with D matches our intuition: once an A has been emitted, our uncertainty has been reduced by 1. 19 Information conveyed by a particular word I(T G | W i = wi ) = H(T G | w...i−) − H(T G | w...i ) (23) This gives us the information a comprehender gets from a word. 20 The entropy of a grammar symbol, say VP, is the sum of • the entropy of a single-rule rewrite decision • the expected entropy of any children of that symbol (Grenander 1967) 21 The entropy of a grammar symbol Let the productions in a PCFG G be Q . For a left-hand side ξ , the rules rewriting it are Q (ξ). Example: VP -> V NP VP -> V Let h be a vector that is indexed by ξ i , the symbols. So we can say h(ξ ) to mean, say, the first cell in the vector. For any grammar symbol ξ i we can compute the Entropy using the usual formula: h(ξ ) = − X pr log pr (24) Q r∈ (ξ ) 22 The entropy of a grammar symbol More generally: h(ξ i ) = − X pr log pr (25) Q r∈ (ξi ) So now the vector h has the Entropies of each ξ i . 23 The entropy of a grammar symbol, say VP, is the sum of • the entropy of a single-rule rewrite decision ✔ • the expected entropy of any children of that symbol Now we work on the second part. 24 The expected entropy of any children of a symbol Suppose Rule r rewrites s nonterminal ξ i as n daughters. Rule r1 VP -> V NP Rule r2 VP -> V ξ = V P Q (V P ) = r1, r2 H(V P ) = h(V P ) + [H(V r) + H(N P r ) + H(V r)] (26) 25 The expected entropy of any children of a symbol More generally: H(ξ i ) = h(ξ i) + X [H(ξ i ) + · · · + H(ξ in )] (27) Q r∈ (ξi ) 26 The information conveyed by a word I(T G | W i = wi ) = H(T G | w...i−) − H(T G | w...i ) (28) 27 Example 28 The horse raced past the barn fell Initial conditional entropy of the parser state:H G(S) = .. Input “the”: Every sentence in the grammar begins with “the”, so information gained, no reduction in entropy. 29 The horse raced 30 The horse raced past the barn fell 31 Subject relatives 32 Object relatives 33 In closing • What are some of the similarities and differences between connectionist probabilistic models? • How would one (could one?) reconcile symbolic and connectionist and/or probabilistic approaches? 34 *References Crocker, M., & Keller, F. (2005). Probablistic grammars as models of gradience in language processing. In G. Fanselow, C. Féry, R. Vogel, & M. Schlesewsky (Eds.), Gradience in grammar: Generative perspectives. Oxford University Press. Hale, J. (2003). The information conveyed by words in sentences. Psycholinguistic Research, 32 (2), 101–123. Journal of Jurafsky, D. (1996). A probabilistic model of lexical and syntactic access and disambiguation. Cognition, 20, 137–194. Lapata, M., Keller, F., & Schulte im Walde, S. (2001). Verb frame frequency as a predictor of verb bias. Journal of Psycholinguistic Research, 30 (4), 419-435. Schneider, T. (2005, March). Information theory primer. Manuscript on internet. 35
© Copyright 2026 Paperzz