Generation Lexical selection and collocation Generation and lexical selection Ann Copestake Natural Language and Information Processing Group Computer Laboratory University of Cambridge June 2008 Generation Lexical selection and collocation Outline. Generation Overview: components of a generation system Generation and parsing in constraint-based formalisms Lexicalist generation Realisation ranking Lexical choice Lexical selection and collocation Generation Lexical selection and collocation Outline. Generation Overview: components of a generation system Generation and parsing in constraint-based formalisms Lexicalist generation Realisation ranking Lexical choice Lexical selection and collocation Generation Lexical selection and collocation Terminology Content in KB STRATEGIC GENERATION ? LF TACTICAL GENERATION ? string (plus markup) STRATEGIC GENERATION organizing knowledge to be conveyed and constructing a LF that corresponds to a sentence. TACTICAL GENERATION/REALIZATION LF to string. In principle, independent of domain knowledge. Generation Lexical selection and collocation Tasks in generation Content determination/selection deciding what information to convey (small amount of recent work on statistical approaches) Document structuring Aggregation deciding how information may be split into sentence-sized chunks Referring expression generation deciding when to use pronouns, etc. (mostly limited domain) Lexical choice deciding which lexical items to use to convey a given concept (mostly limited domain). Surface realization mapping from a meaning representation for an individual sentence to a string (or speech output). Generation Lexical selection and collocation Properties of a grammar Grammar A grammar consists of a set of grammar rules, G, a set of lexical entries, L, and a start structure, Q. Lexical sign A lexical sign is a pair hL, Si of a TFS L and a string list S Valid phrase A valid phrase P is a pair hF , Si of a TFS F and a string list S: 1. P is a lexical sign, or 2. F is subsumed by some rule R and hF1 , S1 i . . . hFn , Sn i s.t. R’s daughters’ subsume F1 . . . Fn and S is the ordered concatenation of S1 . . . Sn . Sentences A string list S is a well-formed sentence if there is a valid phrase hF , Si such that the start structure Q subsumes F . Generation Lexical selection and collocation Properties of parsing and generation Parsing a string, S, consists of finding all valid phrases hF1 , Si . . . hFn , Si such that the start structure Q subsumes each structure F1 to Fn . Generating from a start structure, Q 0 , which is equal to or subsumed by, the general start structure Q, consists of finding all valid strings, S1 . . . Sn which correspond to valid signs, hF1 , S1 i . . . hFn , Sn i such that the start structure Q 0 subsumes each structure F1 to Fn . Generation Lexical selection and collocation Equivalent logical forms Have to instantiate Q 0 with an LF in a particular syntactic form. Logical form equivalence problem: • Multiple LFs are logically equivalent. • We can’t tell which LF a grammar will accept. • LF equivalence problem is undecidable, even for FOPC Two part solution: 1. Not all semantic equivalences should be treated as equivalent inputs for generation ∀x[student 0 (x) =⇒ happy 0 (x)] ¬∃x[student 0 (x) ∧ ¬happy 0 (x)] ∧ (happy 0 (k ) ∨ ¬happy 0 (k )) 2. Allow for some variation via flat semantic representation: 2.1 [this(c) ∧ dog(c) ∧ chase(e, c, c 0 ) ∧ the(c 0 ) ∧ cat(c 0 )] 2.2 [cat(c 0 ) ∧ chase(e, c, c 0 ) ∧ dog(c) ∧ the(c 0 ) ∧ this(c)] Generation Lexical selection and collocation Naive lexicalist generation 1. From the LF, construct a bag of instantiated lexical signs. 2. List the signs in all possible orders. 3. Parse each order. • Highly independent of syntax • Requires lexical entries to be recoverable • Not exactly efficient . . . • Shake and Bake generation is part of an approach to MT in which transfer operates across instantiated lexical signs • Shake and Bake isn’t as bad as the totally naive approach, but still worst-case exponential Generation Lexical selection and collocation Lexical lookup for lexicalist generation a0 (y ), consultant0 (y ), german0 (y ), every0 (x), manager0 (x), interview0 (epast , x, y ) The instantiated lexical entry for interview contains: interview0 (e1, x1, y 1) (e1,x1 and y 1 constants) Complications: • Lexical rules: past form of interview • Multiple lexical entries (cf lexical ambiguity). • Multiple relations in a lexical entry, with possible overlaps. E.g., who — which_rel, person_rel. • Lexical entries without relations (e.g., infinitival to). Generation Lexical selection and collocation Chart generation Lexical signs are used to instantiate the chart, then generation as parsing. id 1 2 3 4 5 6 12 18 22 24 MRS string Lexical edges a0 (y 1) a/an consultant0 (y 1) consultant german0 (y 1) German every0 (x1) every manager0 (x1) manager interview0 (e1past , x1, y 1) interviewed Some of the edges constructed a0 (y 1), consultant0 (y 1) a consultant every0 (x1), manager0 (x1) every manager interview0 (e1past , x1, y 1), interviewed a a0 (y 1), consultant0 (y 1) consultant german0 (y 1), consultant0 (y 1) german consultant drs (1,2) (4,5) (6,12) (3,2) Generation Lexical selection and collocation Chart generation, more details 1. indexing can be done by semantic indices 2. indices are constants, therefore don’t get incorrect coindexation 3. daughters may not overlap (check overlap on LF: MRS is good for this) 4. still worst case exponential (intersective modifiers) Generation Lexical selection and collocation Chart generation in the LKB 1. algorithm above enhanced to allow semantics to be contributed by rules 2. MRS input makes construction of the bag of signs fairly easy 3. some tweaks for added efficiency 4. with added tweaks, there’s actually no advantage in indexing by semantic indices 5. overgeneration is an issue with the LinGO ERG, mainly with respect modifier order e.g., big red box, ?red big box. 6. stochastic ordering constraints: less well investigated than for parsing Generation Lexical selection and collocation Problems with purely symbolic generation • Controlling realizations: cf ambiguity in parsing • detailed grammars, very specific input • ‘grammaticality’ vs fluency • collocation (in the linguistic sense) • adjective ordering: big red triangle vs ?red big triangle • heaviness: ?Kim gave the very important consultant it vs Kim gave it to the very important consultant • Information structure. • Topicalization: e.g., Bananas, I like Generation Lexical selection and collocation Statistical generation approaches 1. n-grams on a word lattice Langkilde and Knight: • shallow hand-written rewrite grammar generates a word lattice • concepts based on WordNet, lexical choice among elements in synsets • bigram model for choosing between realisations 2. train a bi-directional grammar on a realisation bank 3. model specific problems that the symbolic grammar doesn’t deal with Techniques are not mutually exclusive. Generation Lexical selection and collocation Statistical generation model with an HPSG • Erik Velldal: Velldal and Oepen (2006) • n-gram models trained on BNC • symmetric treebanks: • standard Redwoods Treebank: select analysis (and thus semantics) for items in corpus. • symmetric treebank: record other possible realisations for the given semantics • Maximum entropy model trained on TreeBank • Selective unpacking • n-grams in addition Generation Lexical selection and collocation Adjective ordering • Logical representation does not determine order wet(x) ∧ weather(x) ∧ cold(x) • Constraints / preferences: big red car * red big car cold wet weather wet cold weather (OK, but dispreferred) • bigrams perform poorly (sparse data) • positional probability — i.e., probability adjective is first/second in any pairing • Malouf (2000): memory-based learning plus positional probability Generation Lexical selection and collocation Lexical choice in lexicalist generation • Basic assumption is that EP corresponds to word. Null semantics items dealt with via ‘trigger rules’. • Grammar controls some lexical selection (see later), so input to generator shouldn’t have to specify it. • Other cases are partially conventional but not dealt with in the grammar. Determiner choice: I cut my face vs *I cut the face • Currently, generator input to ERG has to be very precisely specified: unsuitable for many potential applications. Generation Lexical selection and collocation Determiner choice We went climbing in Andes president of United States I tore pyjamas I tore duvet George doesn’t like vegetables We bought new car yesterday cf Minnen et al: a/the/no determiner selection Generation Lexical selection and collocation Outline. Generation Overview: components of a generation system Generation and parsing in constraint-based formalisms Lexicalist generation Realisation ranking Lexical choice Lexical selection and collocation Generation Lexical selection and collocation Types of grammatical selection • syntactic: e.g., preposition among selects for an NP (like other prepositions) • lexical: e.g., spend selects for PP headed by on Kim spent the money on a car • semantic, but conventionalised: e.g., temporal at selects for times of day (and meals) at 3am at three thirty five and ten seconds precisely Generation Lexical selection and collocation Lexical selection spend_v2 := v_np-pp_le & [ STEM < "spend" >, SYNSEM [ LKEYS [ -OCOMPKEY _on_p_rel, KEYREL.PRED "_spend_v_1_rel" ] ] • ERG relies on convention that different lexemes have different relations • ‘lexical’ selection is actually semantic. cf Wechsler • no true synonyms assumption, or assume that grammar makes distinctions that are more fine-grained than real-world denotation justifies. • near-synonymy would have to be recorded elsewhere Generation Lexical selection and collocation Semantic selection In ERG, specify a higher node in the hierarchy of relations: at_temp := p_np_i-tmp_le & [ STEM < "at" >, SYNSEM [ LKEYS [ -COMPKEY hour_or_time_rel, KEYREL.PRED _at_p_temp_rel ] ] ] • Semantic selection allows for indefinitely large set of alternative phrases. • productive with respect to new words, but exceptions allowable: not falsified if e.g., *at tiffin • ERG lexical selection is a special case of ERG semantic selection! • also idiom mechanism in ERG Generation Lexical selection and collocation Denotation and grammar engineering • Denotation is truth-conditional, logically formalisable (in principle), refers to ‘real world’ (extension) • Must interface with non-linguistic components • Minimising lexical complexity in broad-coverage grammars is practically necessary • Plausible input to generator: reasonable to expect real world constraints to be obeyed (except in context) Generation Lexical selection and collocation Denotation and grammar engineering • Assume linkage to domain, richer knowledge representation language available • TFS language for morphology, syntax, compositional semantics: not intended for general inference. • Talmy example: the baguette lay across the road across - Figure’s length > Ground’s width • identifying F and G and location for comparison in grammar? • coding average length of all nouns? • allowing for massive baguettes and tiny roads? Generation Lexical selection and collocation But . . . • KR currently assumes description logics rather than richer languages, so inference will be limited. • Need to think about the denotation to justify grammaticization (or otherwise): if temporal in/on/at share denotation, selectional account for distribution unreasonable to expect in/on/at in generator input • Linguistic criteria: denotation versus grammaticization? effect found cross-linguistically? predictable on basis of world knowledge? closed class vs open class • Practical considerations about interfacing • allow generalisation over e.g., in/on/at in generator input, while keeping possibility of distinction Generation Lexical selection and collocation Collocations • Intuition: two or more lexical items occurring together in some syntactic relationship more frequently than would be expected, even given world knowledge. e.g., shake and fist but NOT buy and house • anti-collocation: concentrated tea vs strong tea • collocation or semantics? is this even a testable concept? heavy smoker, heavy use, heavy consumption heavy weather, heavy sea, heavy breathing (compare with strong) Generation Lexical selection and collocation Collocation versus denotation • Whether an unusually frequent word pair is a collocation or not depends on assumptions about denotation: fix denotation to investigate collocation • Empirically: investigations using WordNet synsets (Pearce, 2001) • Anti-collocation: words that might be expected to go together and tend not to e.g., flawless behaviour (Cruse, 1986), big rain (unless explained by denotation) Generation Lexical selection and collocation Bake versus roast roast bake beef, pork, chicken, head of lamb, camel, goat, !cow, lizard, crab, *ham, *gammon bread, cake, pies potato, chestnuts, turnip, crab apples, ?apple ham, fish, snake, barracuda steak apple, potato Gratin Dauphinois, pork chops with apples coffee beans clay, concrete, earth duke pottery, resistance films metal ore Generation Lexical selection and collocation rancid in the BNC most frequent uses (77 cases in 100 million words): fat 6 butter 5 oil 5 meat 4 pork 2 odour 3 smell 2 Possibly speaker-dependent status: Collocation rancid normally occurs with oily things (or, for some people, dairy products, or . . . ) but just means ‘off’ Denotation (technical use) rancid refers to a certain sort of ‘offness’ (associated with oxidized fat) Generation Lexical selection and collocation Final summary: Data-driven and generative approaches. • Two paradigms: 1. formal linguistics/generative grammar 2. data-driven techniques • Hypothesis: data-driven techniques work because they model some aspects of language that the classical approaches don’t. • Course has discussed some areas of semantics and pragmatics in computational linguistics where combined models seem promising. • Many more questions than conclusions! Generation Lexical selection and collocation Final summary: Data-driven and generative approaches. • Two paradigms: 1. formal linguistics/generative grammar 2. data-driven techniques • Hypothesis: data-driven techniques work because they model some aspects of language that the classical approaches don’t. • Course has discussed some areas of semantics and pragmatics in computational linguistics where combined models seem promising. • Many more questions than conclusions!
© Copyright 2026 Paperzz