International Conference on Intelligent Computer Communication, C. Unger & A. Leţia (eds.), Casa Cărţii de Ştiinţă, Cluj-Napoca, 1995 LEARNING THE MEANING OF UNKNOWN WORDS Dan Tufiş Romanian Centre for Artificial Intelligence 13, "13 Septembrie", 74311, Bucharest 5, Romania fax: +(40 1)411 39 16 e-mail : [email protected] Abstract Lexical acquisition is one of the most difficult problems in building up operational natural language processing systems. Automatic learning of new words (morphological, syntactic and semantic properties) is even harder. The paper discusses our solution for overcoming the lexical gaps. Learning what some unknown words might mean is abductively driven by the world knowledge and the local context of the unknowns, according to the pragmatic principle stating that abductive inference is inference to the best explanation. A generalization of the weight assignment for weighted abduction inference strategy is proposed in order to overcome some inherent difficulties existing in a standard cost-based abductive processing. 1. Introduction Recent years have seen a great interest in abductive inference, as a complement for the older deductive approach, applied in the area of natural language processing [Hobbs,1986, Hobbs,1990a, Stickel,1989, Appelt,1990, Konolige,1991, Tufiş,1992], etc. It was rightfully claimed that the abductive approach yields not only a simplification but also a significant extension of the range of the phenomena that can be captured: reference resolution, compound nominal interpretation, resolution of syntactic ambiguity understanding metonymy, etc. Hobbs has shown [Hobbs,1990a] how “interpretation as abduction” can be naturally combined with the former view of “parsing as deduction” [Pereira,1983] to produce an elegant and thorough integration of syntax, semantics and pragmatics, accommodating both interpretation and generation. Translation fits also nicely within this abductive framework, being regarded as a matter of interpreting in the source language and generating in the target language [Hobbs,1990b]. In the following we will show the use of abduction in dealing with one of the most sensitive problems in natural language processing: overcoming the lexical gaps, that is learning from context the meaning of unknown words. Modern parsers can predict the syntactic category of an unknown word without much difficulty. Some of them may even extend this ability to offer a complete feature-based syntactic description of the unknown word. But few of them offer a uniform approach to the contextual meaning prediction of a word missing from the system’s lexicon. Weighted abduction provides a natural environment for solving this problem. Informally, understanding contextually an unknown word UW, may be stated as: 1. identify a new word NW which makes it possible for the sentence to be interpreted; 2. assume that in the given context, UW and NW are interchangeable, that is they are (at least partially) synonymic. To interpret a sentence means: 1. proving the logical form of the sentence, together with the constraints that predicates impose on their arguments, allowing for coercion; 2. merging redundances where possible; 3. making assumptions where necessary. 2. Weighted abduction Let Σ be a consistent theory of a domain, expressed as a set of sentences in a given firstorder language L. Let Ψ be a set of facts (expressed in the same language L) about the domain modeled by the theory Σ and φ a set of explanations, according to Σ for the facts in Ψ. Let us further consider |— the Σ inferential operation, connecting a set of facts in Ψ to a set of explanations in φ. Now, consider F a set of new observed facts about the domain. If there exist a set E ⊆ φ so that E |— F then E is said to represent a deductive set of explanations for F. If such an E is not found but there exists a set A of sentences called assumptions so that: (1) A ∩ φ = { } (2) A is admissible1 for the theory Σ (3) ∃ E' ⊆ φ, A ∪ E' |— F then A ∪ E' is said to represent an abductive set of explanations for F. Abductive reasoning requires augmenting the set of known facts with a set of facts assumed to be true. This set should not be contradictory with what is already known. 1 The notion of admissibility as used here is very close to the one in [Appelt,1990]. For an algorithm checking the admissibility of an assumption set with respect to a first-order theory, see [Appelt,1990]. 2 Obviously, in trying to abductively explain some observed facts, one may use different sets of assumptions. The problem in an abductive inference engine is to provide for the “best” set of assumptions. The weighted abductive prover, underlying the work reportered here, was developed by Mark Stickel [Stickel,1988] and its control, [Appelt,1990], is implemented as an alternative to the statistical methods. The abduction is carried on based on model preferences according to which making an assumption is viewed as restricting the models of the background knowledge. The preference ordering is supported by the use of annotations on the rules encoding the underliyng theory. These annotations are expressed as numeric weights. The “bestness” od an assumption set among other competing anes may be defined in many different ways, out of wich the cost criterion is a tempting solution. In weighted abduction, the competing assumption sets are preferenially ordered according to their decreasing costs. During the abductive inferential process, the costs of different competing assumption are computed on the basis of the weights annotating the applicable rules. An annotated inference rule represented as: P1w1 & P2w 2 . . . & Pkw k ⊃ Q and has to be interpreted (from the cost point of view) as follows: “if the cost of assuming Q is C then the cost of assuming Pi is wi • C” A term in the antecedent of a rule, appearing without a weight, accounts for a predicate which cannot be assumed but must be proved. In proving it, some finer grained pieces of knowledge, necessary for proof completion, could be assumed if not stated otherwise. The weight assignment over the terms in a rule is very important for abduction control. k So, if å Wi i =1 ñ1 then assuming Q is cheaper than assuming P1, P2, ..., Pk, thus least specific abduction is favoured. k If å Wi á 1 then assuming P1, P2, ..., Pk is cheaper than assuming Q and most-specific i =1 abduction is encouraged. k On the other hand, even if å Wi i =1 ñ1 and some Pi’s were already derived, the cost of assuming the rest of the terms in the antecedent might be cheaper than assuming the consequent. Theorem proving techniques include, among other operations, the factorization of the logical expressions generated during a proof. This is necessary not only for efficiency reasons but also for completeness of the demonstration procedure [Tufiş,1981, Shapiro,1979]. With weighted abduction, there is another benefit of factorization. Allowing for factoring of logical expressions, the weighted abduction provides an elegant way of exploiting the natural redundancy of texts, overriding least-specific abduction. 3 Consider that one is supposed to derive Q1 & Q2 with an equal cost, say $10, for each conjunct. Assuming Q1 & Q2 would therefore cost $20. But suppose there are two inference rules, stating: 0.6 0.6 P1 & P 2 ⊃ Q1 0.6 & P2 0.6 ⊃ P3 Q2 Although each rule favors least-specific abduction, factoring over P2 makes the problem resolution $2 cheaper than if the most-specific-abduction strategy had been applied. 3. Overcoming the Lexical Gaps As one would expect, interpretation of a sentence containing unknown words will lead to a set of assumptions, some of them referring to the meaning of the unknowns. To exemplify our approach to understanding unknown words by abduction, let us consider the simple example2 : EX1) The car veered right. containing the unknown (to the system) word “veered”. Now, suppose that the knowledge base of the system contains (among others) the following rules: 1.0 1.0 (1) mobile(x) & move1(e,x) ⊃ constraint(right1, e) 1.0 (2) car(x) ⊃ mobile(x) The formula “constraint(p,e)” stands for the restrictions that the predicate “p” places on its eventuality argument “e”. This predicate belongs to the same class of coercion predicates as “Req” and “rel” (see [Hobbs,1990a]). Unlike “Req” and “rel” which restrict the type of participants in an event, “constraint” restricts the type of the event itself. These coercion predicates seem to be second-order, but as shown in [Hobbs,1985,1990a] they are easily expressed equivalently as first-order predicates. The axioms above read as follows: 2 In the following, for the sake of simplicity, we'll leave out some details, not relevant here, as tense, agreement, etc. 4 (1') One way of satisfying the requirements imposed by the adverbial reading of the word right (right1) on its argument “e” is to hypothesize that e is the event described by a move action carried out by a mobile agent. (2') One way to prove that something is a mobile thing, is to prove it is a car. Let us further consider that there exist (among other similar rules) the following rules: 0.5 0.1 1.0 0.1 unk(i, j, w1) & w 1(e, x) & cat(w 2, verb1) & w 2 (e, x) & (3) 0.5 ass -syn(e, w1, w 2) ⊃ verb1(i, j, w1) The reading of this particular axiom is: (3') One way of proving the existence of a verb w1 between the interword positions i and j is to prove it is an unknown word (unk(i,j,w1)) referring to an eventuality e. The agent of e is any x (w1(e,x)) and there exist a word w2 of the same category - here an intransitive verb (cat(w2,verb1)) so that w2 could refer to the event e implying the agent x (w2(e,x)). In the context of “e”, w2 is a synonym for w1 (ass-syn(e,w1,w2)). The last atomic formula in the antecedent of (3) will not be provable but always assumable. To show how the sentence in example 1 is interpreted, let us consider the grammar rules below, augmented with semantic and pragmatic information3 . 0.4 0.8 (4) np(i, j, x) & vp(j, k, e, x) ⊃ s(i, k, e) 0.2 1.0 0.05 (5) det(i, j, w1) & n(j, k, w 2) & w 2 (x) ⊃ np(i, k, x) 0.8 0.05 0.05 0.05 verb1(i, j, w1) & w1(e, x) & adverb(j, k, w 2) & w 2 (e) & (6) 0.1 constraint(w 2 , e) ⊃ vp(i, k, e, x) Stated in English, the rules above say that: (4') To prove that a string of words between the positions i and k represents a sentence referring to an event e, one has to prove an np from i to j the referent of which is x, and a vp from j to k denoting an event in which x is the agent. (5') To prove an np between the positions i and k the referent of which is x, one has to prove the existence on a determiner (from i to j) in front of a noun the referent of which is x. 3 All the variables in these formulae are implicitly universally quantified. 5 (6') To prove a vp between the positions i and k denoting an event the agent of which is x, one has to prove an intransitive verb (verb1) between the positions i and j which could identify an event the agent of which is x; adjacent to the verb, there must be an adverb which could modify the event e. The event e must satisfy the constraints imposed by the adverbial modifier. The weights assignment is not our main concern here, but as we will see in the next paragraph, annotating the constituents of the phrase-structure rules nicely allows for dealing with extragrammatical input. The next paragraph will discuss also a cost-based treatment of noun-phrase referent identification. Here it will suffice to say that the weights assignment is based on empirical evidence (for instance, in 4, the weight on the “vp” is higher than that on the “np” since presumably a verb phrase provides more evidence for the existence of a sentence than a noun phrase does). To assign an interpretation to EX1) means to prove the following statement (having let's say $100 assuming budget): ∃ e s(0,4,e) Because of the weights assignment in rule 4 (w1+w2 > 1), we cannot assume the existence of both np and vp, but we need to be able to prove, even partially, at least one of them. 0.4 0.8 np(0, j, x) & vp(j,4, e, x) The proof of the first term can be done according to rule 5 (almost) for nothing: 0.2 det(0,1, the), from the input string, matches det (0, j, w1) 1.0 n(1,2, car), from the input string, matches n(1, k, w 2) To complete “np” it remains to prove or assume car(x), depending on whether or not an object denoted by the word car, say car00017, is known in the knowledge base. If it is not known its existence is assumed for only $2 (0.4 * 0.05 * $100), as part of the new information carried by the sentence. 0.8 To prove vp(2,4, e, car00017) , rule 6 is relevant, and when fired, the new goal to prove would be: 0.64 0.04 0.04 0.04 0.08 verb(2, j, w1) & w1(e, car00017) & adverb(j,4, w 2) & w 2 (e) & constraint(w 2 , e) The new weights annotating the predicates above resulted from the multiplication of the initial weights in rule 6 with the weight of the previous vp goal (0.8). 6 The word “veered” being unknown to the system, it will be assigned the interpretation unk(2,3,veered). Since, a matching between verb1(2,j,w1) and unk(2,3,veered) is not possible, the rule 3 would be fired eventually. Now, what is to be proved is: 0.32 0.064 0.064 0.064 0.32 unk(2, j, w 1 ) & w 1 (e1 , x) & cat(w 3 , verb1) & w 3 (e', x) & ass -syn(e1 , w 1 , w 3 ) 0.04 0.04 0.04 0.08 & w 1 (e, cat00017) & adverb(j,4, w 2 ) & w 2 (e) & constraint(w 2 , e). By factoring (see [Hobbs,1990a]) the expression above would yield4: 0.32 0.064 0.04 0.064 unk(2, j, w1) & w1(e, car00017) & cat(w 3, verb1) & w 3 (e, car00017) & 0.32 0.04 0.08 0.04 ass -syn(e, w 1, w 3) & adverb(j,4, w 2) & w (e) & constraint(w 2 , e) 2 0.32 The input unk(2,3,veered) matches unk(2, j, w 1) so after variable binding the expression to be further on proved becomes: 0.064 veered(e,car00017) 0.064 0.064 & cat(w 3, verb1) & w 3 (e, car00017) & 0.32 0.04 0.08 0.04 ass -syn(e, veered, w 3) & adverb(3,4, w 2) & w 2 (e) & constraint(w 2 , e) 0.04 The term adverb(3,4, w 2 ) matches the input adverb(3,4,right) so that proving 0.08 constraint(right, e) fires rule 1. One obtains: 0.064 0.064 0.064 0.04 veered(e,car00017) & cat(w 3, verb1) & w 3 (e, car00017) & right(e) & 0.32 0.08 0.08 ass -syn(e, veered, w 3) & move1(e, x) & mobile(x) 0.08 The predicate mobile(x) is easily proved (rule 2) binding x to car00017. Again, by factoring, the above expression is reduced to: 0.064 0.064 0.08 veered(e,car00017) & cat(move1, verb1) & move1(e,car00017) & 0.04 0.32 right(e) & ass -syn(e, veered,move1) 0.064 By dictionary check-up, cat(move1,verb1) is proved and finally one gets: 4 The factor of two unifiable predicates annotated by the weights w1 and w2 will be assigned the weight min(w1, w2). 7 0.064 0.08 0.04 veered(e,car00017) & move1(e,car00017) & right(e) & 0.32 ass -syn(e, veered,move1) Since nothing can be proved further on, these four conjuncts have to be assumed. The system assumes (for $6.4), the word “veered” to be a (intransitive) verb and (for another $32) that it is (partially) synonymic to the verb “to move” (veering is a kind of moving5). Under this assumptions, the new information carried by the sentence is that a specific car participated in an event of type move1 (which could be more precisely referred to as a veering event) and that the direction in which this event progressed was to the right. 4. Extragrammaticality and noun-phrase referent identification We have seen in the previous paragraphs, that weights were annotating not only semantic and pragmatic information but syntactic as well. As already suggested, the effect of using weights on different syntactic constituents would allow for dealing with syntactic deviant sentences. Thus, if the input matched no rule in the grammar, the matching condition could be relaxed by making abductive assumptions. Recall, for instance, rule 4 (and its English formulation) we used in chapter 3: 0.04 0.08 (4) np(i, j, x) & vp(j, k, e, x) ⊃ s(i, k, e) The annotations on the categorial predicates provide an opportunity to accept incomplete sentences or ill-formed constituents. A missing noun phrase can be assumed (according to rule 4 for 40% of the sentence interpretation cost. According to rule 5, a noun phrase has to be made up of a determiner and a noun. So, if either of them is missing, the noun phrase is illformed. Anyway, the relatively low cost on the determiner, allows in our case to interpret a noun as a noun phrase for only 8% (0.4 * 0.2) from the sentence interpretation cost. On the other hand, the failure to identify a noun, even with a determiner found, would force the assuming of the whole noun phrase (the “least specific” strategy is here cheaper than the “most specific” strategy). The same discussion holds for the verb phrase too, but the assumption cost is significantly higher. In the following, we will present an extension to the abductive engine presented so far. The necessity of this extension will be revealed soon. 5 This approach to the partial synonymy could be elegantly expressed by the axiom: (∀x,e) ass-syn(e,w1,w2) & (w1(e,x) & etc(x) |— w2(e,x)) where “etc(x)” is a predicate standing for whatever unspecified properties of x. It will never be provable but assumable. The predicate “etc” is a logical device analogous to the abnormality predicate in the circumscription logic [McCarthy,1987]. 8 Let us focus our attention to the interpretation of the determiner with respect to identifying the referent of a noun phrase. Most NL designers take a definite noun phrase to refer to an entity already known in the context while an indefinite noun phrase is taken to introduce a new object in the context. Let us consider the rule: (7) detq1 (i, j, w1) & nounq 2 (j, k, w 2) & w 2q3 (x) ⊃ np(i, k, x) In this rule, the predicate “w2(x)” accounts for identifying the referent of the string “w1w2”, proved (or assumed) as the structure of noun group. It is obvious that the expectation of proving w2(x) heavily depends on the actual string making the noun group. Then, if we adopt the same convention, w2(x) would be provable if w1 represented a definite determiner, and w2(x) would be assumable if w1 represented an indefinite determiner. To account for making such a distinction a solution might be to split rule 7 into two more specific rules: (8) def - detq'1(i, j, w1) & nounq'2 (j, k, w 2) & w 2q'3 (x) ⊃ np(i, k, x) (9) indef - det q"1(i, j, w1) & noun q"2 (j, k, w 2) & w 2q"3 (x) ⊃ np(i, k, x) In order to follow the above convention, in rule 8 the assumption cost for the referent of the noun group should be very high, while in 9 it should be very low. There are some problems with this approach. For instance, suppose that the assumption budget for finding an “np” is C. Let us further suppose a state of the world, STATE1, in which an object “car00017” is known so that “car(car00017)” is true, and another one, STATE2, in which there is no such object. Now, let's examine what happens when the system receives the following strings: Ex2) ... the car ... Ex3) ... a car ... Ex4) ... car ... Being in STATE1 the interpretations of the noun-groups in Ex2)-Ex4) will cost: cost-ex2-STATE1 = 0 (according to rule 8) cost-ex3-STATE1 = 0 (according to rule 9) 9 cost-ex4-STATE1 = min (q'1 • C, q1" • C) Being in STATE2, the interpretations of the same substrings will yield the following costs: cost-ex2-STATE2 = q'3 • C (according to rule 8) cost-ex3-STATE2 = q"3 • C (according to rule 9) cost-ex4-STATE2 = min [(q'1+q'3) • C, (q1"+q3") • C] Since there is no serious argument for assigning different assumption costs for the two types of determiners, probably q'1 ≅ q"1 = q1. According to the previous discussion, q'3 >> q"3. In this case, cost-ex4-STATE1 = q1 • C (according to either rule 8 or rule 9) cost-ex4-STATE2 = (q"1+ q"3) × C (according to rule 9). Let us notice that although q"1+q"3 < q'3, it is not possible to assume “the” being an “indef-det” for applying rule 10 instead of 9 when interpreting “the car” in STATE2. Such an assumption would contradict the rest of known things, namely that “the” is a definite determiner. The results are summarized below: Table 1 Substring the car a car car STATE1 cost applied-rule 0 rule 8 0 rule 9 C • q1 rule 8 or rule 9 STATE2 cost applied-rule C • q'3 rule 8 C • q"3 rule 9 C•(q"1+q"3) rule 9 There are some queer results in here. First, it seems odd that irrespective of definiteness of the noun-group, when interpreted in the world WORLD1 its referent is the same (“car00017”). This association is reflected in the table above by the null cost on the noun phrases “the car” and “a car”. Second, it might be of to interest to be able to identify precisely what rule should have been applicable when an assumption was made to overcome a certain grammatical deviance6. 6 This might be useful, for instance, in a CALL system equipped with an explanatory module dealing with ungrammatical input. 10 In our examples (the STATE1 case), because of presumably equal weights on the definite and indefinite determiner predicates, the incomplete noun-phrase “car”, although recovered (correctly or not), lost this information (of being definite and indefinite). A local solution to this drawback would be to assign a slightly lower weight to “def-det” than to “indef-det”, so that rule 8 be chosen. This patch would indeed favor rule 8, but the problem with indiscriminate referent resolution still remains. Now, suppose we decided that whenever an indefinite noun-phrase is interpreted it will always introduce in the context a new object of the appropriate type, irrespective of whether an object of the type in question already exists. The use of type predicate wq(x) is not possible in this case. Indeed, if an appropriate object existed in the context, x would be bound to it. To handle this problem, we will introduce the always assumable predicate Ass(p,x) which asserts the truth of the predicate p(x), x being an arbitrary new object. A further enhancement of the abduction mechanism will be to generalize weights even beyond what Stickel already did. Stickel (1989) generalized the weights assigned to the terms in the antecedent of an axiom, to arbitrary functions of the assumability cost of the consequent of the axiom. While consistently more powerful, this generalization seems not being enough. What we would like to do is to allow for annotating the terms in the antecedent of a rule by what might be called “assumption-cost-track functions”. An assumption cost-function (ACTF) would specify the cost of assuming a predicate not just as a percentage or even a function of a given figure but as a relation to the costs of other assumptions already made. (10) Consider the new rule, replacing rule 9. q indef - det q1 (i, j, w1) & nounq 2 (i, k, w 2) & w 2 3 (x) & q4 ( Ass1 w 2 , y) ⊃ np(i, k, y) ì 0 if w (x) was assumed 2 where q4 = ACF(w 2 (x)) = ïí ïîq"4 otherwise The results of introducing rule 10 in the grammar can be shown in the table below: Table 2 Substring the car a car car STATE1 cost applied-rule 0 rule 8 0 rule 10 C • q1 rule 8 STATE2 cost applied-rule C • q'3 rule 8 C • q"3 rule 10 C•(q"1+q"3) rule 10 As one can see, the net result is that in case of an indefinite noun-phrase, its referent (the variable y) will be bound to a newly created object of the appropriate type. 11 Also, the preferred reading of an incomplete noun-phrase (as a definite noun-phrase if there exist an appropriate referent or as an indefinite noun-phrase otherwise) is solved in a more rigorous manner. 5. Conclusions We have shown how the difficult problem of overcoming the lexical gaps in understanding natural language can be solved within an abductive environment. Understanding unknown words is modeled in terms of the partial synonymy relation. The contextual meaning of the new learnt words are imported from concepts which could be proved satisfying all the restrictions the context impose on them and ranking these synonyms according to their merits in providing the best explanation for why the input string should be a meaningful sentence. The question of the role which the determiner plays in the referent identification problem (in the presence of grammatical deviance) is uniformly solved by virtue of the same basic principle that abductive inference is inference to the best explanation. Acknowledgements Research reported here was supported in part by a grant from the International Research & Exchanges Board (IREX), with funds provided by the Andrew W. Mellon Foundation, the National Endowment for the Humanities, Association for Computational Linguistics and the U.S. Department of State. None of these organizations is responsible for the views expressed. I would like to warmly thank the people at SRI working on the TACITUS project. A special mention is due to Jerry Hobbs. Without his kindness, patience and encouragement, this paper would have never been written. 12 REFERENCES Appelt Douglas E., Pollack Martha E.: Weighted Abduction for Plan Ascription. SRI International, Technical Note 491, May 1990 Goodman Bradley A.: Reference Identification and Reference Identification Failures. In Journal of Computational Linguistics, vol. 12, no. 4, pp.273-305, 1986 Hobbs Jerry R.: Overview of the TACITUS Project. Journal of Computational Linguistics, vol. 12, no. 3, 1986 Hobbs Jerry R., Stickel Mark, Appelt Douglas and Martin Paul: Interpretation as Abduction. SRI International, Technical Note 499, December 1990 Hobbs Jerry R., Kameyama Megumi: Translation by Abduction. SRI International, Technical Note 484, May 1990 Konolige Kurt: Abduction vs Closure in Causal Theory. SRI International, Technical Note 505, April 1991 McCarthy John: Circumscription: A Form of Nonmonotonic Reasoning. In M.Ginsberg (ed) Readings in Nonmonotonic Reasoning, pp.145-152 Morgan Kaufmann Publishers, Los Altos, California, 1987 Pereira Fernando C.N., Warren David: Parsing as Deduction. Proceedings of the 21-st Annual Meeting of the Association for Computational Linguistics, pp.137-144, 1983 Shapiro Stuart C.: Techniques of Artificial Intelligence. D. Van Nostrand Company, 1979 Stickel Mark: A Prolog-Technology Theorem Prover: Implementation by an Extended Prolog Compiler. Journal of Automated Reasoning, no.4, pp.353-380, 1988 Stickel Mark: Rationale and Methods for Abductive Reasoning in Natural Language Interpretation. Lecture Notes in Artificial Intelligence, no. 459, pp.233-252, Springer - Verlag, Berlin, 1989 Tufiş Dan I.: Abductive Natural Language Processing. Tutorial Notes for the International Summer School “Current Topics in Computational Linguistics”, Tzigov Chark, 62 pg., September 1992 Tufiş Dan I.: Theorem Proving Techniques in Natural Language Question Answering. Proceedings of the 3-rd INFO-IASI Symposium, pp.264-280, 1981 (in Romanian) 13
© Copyright 2026 Paperzz