Deriving Part of Speech Probabilities from a Machine-Readable Dictionary Deborah A. Coughlin Microsoft Corporation One Microsoft Way Redmond, WA 98052 [email protected] Abstract. A method to add part-of-speech statistics, derived from a dictionary, to a rule-based broad-coverage parser is described. By entering the most likely part-of-speech, determined by the number of senses per part-of-speech in a standard dictionary, to the parser, parsing accuracy improves. This method improves the parser's accuracy without requiring a large corpus or manual encoding. Part-of-speech statistics derived this way may prove useful for languages that do not have large well-balanced corpora available. 1 Introduction A Natural Language Processing system, to achieve broad coverage, needs a complete lexicon. This lexicon must include frequent and seldom-used terms. Even rare parts of speech need to be represented when attempting broad-coverage. Manually coding the required information is both time-consuming and error-prone. A standard on-line dictionary, though, represents centuries of hand-coding by skilled lexicographers. It is meant for broad coverage in the sense that it is intended as a lexical tool for human users in language tasks, regardless of the topic or domain. Machine-readable dictionaries (MRDs) are expanding their user base to include NLP systems because they provide a large and complete lexicon needed for broad coverage. Though dictionaries prove useful as sources of comprehensive lexicons for natural language parsers, their completeness introduces ambiguity that is not easily resolved. The importance of handling part-of-speech ambiguity cannot be overstated. The American Heritage Dictionary (1992 edition) has approximately 18,500 words with multiple parts of speech which represents around 12% of the total number of entries. (Inected forms are included.) These part-of-speech ambiguous words are common words found in many genres. DeRose [5] found that though only 11% of word types in the Brown Corpus were part-of-speech ambiguous, that number represented 48% of the actual tokens, evidencing that words which are part-of-speech ambiguous tend to be common well-used words. Table 1 provides an example of a sentence that contains many part-of-speech ambiguous tokens. As Church [3] points out, relying solely on an augmented transition network (ATN) to determine the most probable part of speech leads to multiple and unlikely parses. This is most likely true for all broad-coverage rule-based approaches. To accomplish broad-coverage, a parser must be able to analyze the variety of structures found in real text. When there are multiple words which are ambiguous with respect to their part of speech in a single sentence, determining the most probable parse becomes a dicult undertaking. This problem becomes extreme when truly broad-coverage parsing is attempted. The brief notes introducing each work oer salient historical or technical points Adj Noun Noun Verb Adv Verb Verb Adj Adj Conj Adj Noun Adv Adj Verb Adj Noun Noun Noun Noun Noun Verb Verb Pron Adj Table 1. An Example from the Brown Corpus of a Sentence with Many Part-of-Speech Ambiguous Tokens It is computationally imperative that the parser be able to choose the most probable parse from the potentially large number of possible parses. Further processing of the input quickly becomes complex and inecient if more than one parse must be considered [9]. Adding part-ofspeech probabilities to the parsing process would guide the parser to produce the most likely of the possible parses. Over the last twenty-ve years, statistical models for part-of-speech tagging of text have made great strides in terms of simplicity of the algorithms used and accuracy of the output [4]. Green and Rubin [8] needed a large rule database for tagging the Brown Corpus. Garside and Leech [7], DeRose [4], Church [2], and others have improved the eciency and accuracy of tagging algorithms while reducing the rule database. Nevertheless, the need for large training corpora remains. Statistical approaches usually require a training corpus that has been manually tagged with part-of-speech information. The Microsoft NLP parser is an empirically-based, broad-coverage, rule-based chart parser. It makes use of augmented phrase structure grammar rules, described in Jensen [9], to provide a syntactic parse that can be exploited later to determine functional roles and long-distance dependencies. Resolving part-of-speech ambiguities is a dicult task for our parser as it is for other broad-coverage parsers. Statistical information derived from large corpora has proved useful in this regard. Richardson [14] used our rule-based parser to derive part-of-speech and rule probabilities from untagged corpora. He then incorporated those part-of-speech and rule probabilities into our parser, improving its speed and accuracy. This approach assumes the availability of a large corpus and a fairly comprehensive parser. Large well-balanced corpora like the Brown Corpus [12] and the Lancaster-Oslo/Bergen (LOB) Corpus [10] are available for relatively few languages. Other sources of part-of-speech probabilities would prove useful for languages without these valuable resources. The remainder of this paper explores the use of MRDs as a potential source of part-of-speech probabilities. 2 Dictionary as a Source of Part-of-Speech Probabilities Zipf [16] found that on average, the more frequently a word occurs, the more senses it had (in the dictionary). Extending Zipf's law a bit further, we postulate that for polysemous words, the more frequent a part of speech, the more senses that part of speech will have. The word follow, for example, has twenty-one verb senses, ve phrasal verb senses and two noun senses in the American Heritage Dictionary (1992 edition). Though we cannot claim that follow is thirteen times more likely to be a verb, the greater number of verb senses coincides with our intuitions about the more likely part of speech. The method is conceptually simple. We begin with a machine-readable version of the American Heritage Dictionary (AHD). The AHD classies words into eight parts of speech: noun, verb, adjective, adverb, pronoun, preposition, conjunction and interjection. For each dictionary entry, we tabulate the number of senses for each part of speech. For instance, the word school has fourteen noun senses, three verb senses and one adjective sense in the AHD. A part-ofspeech probability is then derived based on the number of senses counted. The part of speech with the highest sense count is considered the most likely part of speech. Thus, for school, the noun part of speech would be considered most probable. Though it is incorrect to consider these calculations as the probability of a word to occur in a particular part of speech in real text, we are claiming that for most entries, the larger the ratio of the number of senses per a part of speech to the total number of senses, the more probable that part of speech. Part-of-speech probabilities are also determined for inected forms. Inected forms for dictionary entries are determined automatically by rule-based generation of inectional paradigms and information provided by the AHD. Inected forms are treated as lexicalized entries for the initial computation. For the word cats, for example, a noun and verb record are generated. Because cats can be both a plural noun and a present tense, third-person, singular verb, the sense counts from both the noun and verb senses of cat are assigned to cats. For catting, only the verb senses of cat would be considered. There are twelve noun senses and two verb senses for cats. For this word, the Brown Corpus has nine occurrences of cats as a noun and no verb senses. Statistics derived from the Brown Corpus would favor the noun reading even more than this approach but the relative ranking of the particular parts of speech remains the same. The distance between the noun and verb calculations can be used to further penalize the unlikely verb part of speech. If an inected form is already a dictionary entry (i.e., if it has been assigned senses in the dictionary), the sense counts of the inected form are combined with the counts for the base form. For fell, there are two verb senses, six noun senses and four adjective senses. We add the verb sense counts from fall because fell is also the past tense of fall. Not until the twenty-four verb senses found under fall are added into the calculations do the numbers look reasonable. In the Brown Corpus, fell is a verb 100% of the time. If we use the ratios derived from this method to indicate likelihood of a particular part of speech, fell would have an 89% probability of being a verb. Though this approach does not \equal" the part-of-speech calculations that a statistical approach would generate, it does provide similar results. Figure 1 describes the approach to tabulating part-of-speech sense counts. Once the sense counts are tabulated, they are given to the parser, in hopes that this information will improve the parse quality and reduce the of size of the search space required to get a good parse. 1. Generate inected forms for each dictionary entry. 2. For each entry (inected forms included) in the MRD, { Calculate the number of senses for each part of speech. { If entry is an inected form, calculate the number of senses for the lexeme of the part of speech of the inected form. Add the entry and lexeme counts. Figure1. Algorithm for Determining Part-of-Speech Sense Counts. 3 The System Using Part-of-Speech Probabilities Our project makes use of a rule-based chart parser. In an eort to avoid the inecient allpaths, breadth-rst approach to parsing, we employ a strategy described in Richardson [14]. This approach is similar to the best-rst parsing algorithm Allen [1] describes. The Richardson algorithm used in our system makes use of a priority queue. All part-of-speech records 1 for all words in an input string are placed in the queue. For each word in the input string, the part-of-speech record considered most probable is placed at the top of the queue. This assures that for each word in the input, the most probable part-of-speech record is made available to the parser initially. The rest of the queue contains both rule and part-of-speech records sorted highest probability rst. For each word in an input string, the value of the part-of-speech calculation for the part of speech considered most probable is not relevant. That part of speech is given a probability of 1.0 so that it is placed at the top of the queue. For the other part-of-speech records for each entry, the value of the part-of-speech calculation determines how high it is placed in the queue. Rules are attempted and part-of-speech records are entered into the chart in the order they are found in the queue. Each time a part-of-speech record enters the chart, all applicable rules (based on examination of constituent sequences) are also placed in the queue, positioned according to their associated probabilities. When a parse tree for the entire input string is found, the process ends. Though the set of augmented phrase structure grammar rules that the chart parser relies on are optimized to produce just one parse for each sentence, multiple parses and tted 2 parses are also possible. We used the sense ratio information from the dictionary to reduce the search space and produce the most likely parse rst. This leads to a reduction in the number of multiple and tted parses (for grammatical sentences) and makes the process less computationally expensive. Richardson [14] provides a complete description of this queuing process. He also describes the process that generated the rule probabilities and normalized the part-of-speech to the rule probabilities. The only dierence between his system and the one being described here is the source of the part-of-speech data. For his system, the Brown Corpus was used to compute part-of-speech probabilities; in our approach, a dictionary was used. 4 Results The AHD in machine-readable form was scanned using the algorithm described above to obtain the senses counts for each entry. The process took less than thirty minutes on a Pentium/100 PC. The resulting values were made available to the parser by a probabilistic algorithm described in Richardson [14]. The parser was applied to three sets of test sentences to determine whether this approach was helpful in improving parsing accuracy and eciency. The tests were run using no probabilities and using the part-of-speech probabilities derived from the MRD. The rst test was a list of 2800 linguistic textbook example sentences. This insured that there was a wide variety of linguistically complex structures in the test le. The average sentence length was 10 words. Of the 179 sentences that parsed dierently when dictionary-derived probabilities were added, 64% of the parses generated using these probabilities were better, while 23% represented regressions. Fitted parses were reduced by 45% using these part-ofspeech probabilities. There were 114 tted parses for the 2800 sentence when no probabilities Part-of-speech records are similar to edges in traditional chart parsing. These part-of-speech records represent the union of the syntactic and morphological information contained in the senses for that part of speech. There is only one record per part of speech for a given word. 2 Fitted parses are assigned to strings which cannot be analyzed as fully well-formed sentences according to the grammar embodied in the parser. 1 were used. That number was reduced to 63 using these part-of-speech probabilities. These test results are in Table 2. Better Parse Type of Parse Changes When 64% Probabilities Added Worse Parse 23% Other 3 13% Fewer Fitted Parses 45% Table 2. Test le 1: Nature of Changes in Parsing When Dictionary-Derived Part-of-Speech Probabilities Were Added to Parser. The same test was run using the Brown Corpus derived part-of-speech probabilities generated by Richardson [14]. The parser did better using Richardson's probabilities than it did using dictionary derived probabilities. There were 170 dierent parses when corpus-based probabilities were added. Of those 170 dierences, 74% represented improvements. The number of tted parsed decreased by 47%. It is not surprising that the results using Richardson's corpus-based probabilities were better that those using dictionary-based probabilities. The parser has been using the part-of-speech probabilities generated by Richardson for over two years. Those two years of optimizing the system with those part-of-speech probabilities give it an advantage. Richardson's probabilities include rule probabilities. These same rule probabilities were used for both approaches. The second set of test sentences were 523 randomly selected from the Brown Corpus. The average length of these sentences was 21.5 words. Because of the random nature of the selection method, all types of sentences, including ones containing mathematical formulas were selected. No hand editing was done. In this test, 117 dierences were found when comparing the parses using no probabilities to parses generated using the dictionary-based probabilities. Of those 117, 37% were improvements, while 14% were regressions. The remaining 49% represented parse dierences that could not be considered qualitatively dierent. The same test was applied to the parser using Richardson's part-of-speech probabilities. In this case the results are similar to the dictionary-based results. Of the 121 dierences, 35% represented improvements, while 15% represented regressions. Again, 49% of the dierences were found to be neither improvements nor regressions. Both the corpus-based and the dictionarybased results are presented in Table 3. Dictionary-Based Probabilities Corpus-Based Probabilities Better Parse 37% 35% Worse Parse 14% 15% Other 49% 49% Table 3. Test le 2: When Compared to No Probabilities, Percentage of Change that Represented Improvements, Regressions and No Qualitative Dierence for the Two Approaches. The third test consisted of a selection of both linguistic example sentences and text from a variety of real world sources: The Wall Street Journal, Time Magazine, etc. The average length 3 The category Other indicates those parses were not qualitatively dierent. of these sentences was 17 words. We used this test to focus on the number of multiple parses generated. We assume that a reduction in the number of multiple parses indicates an improvement in the parser's eciency and accuracy. The quality of these parses was not examined, but the rst two test results give us reason to believe that this is true. The parser was set up to allow multiple parses for this test. For the 600 sentence le, the number of sentences that received a non-tted parse was compared to the total number of parses generated; these results are summarized in Table 4. No. of Sentences with Total Number a Parse Parses No Probabilities 529 1059 Dictionary-Based Probabilities 536 971 Corpus-Based Probabilities 546 925 of Average Number of Parses 2.0 1.8 1.7 Table 4. Test le 3: Average Number of Parses per Sentence Successfully Parsed for a 600 Sentence Test File. The dictionary-based approach reduced the total number of multiple parses by 88 parses for this 600 sentence test le. When corpus-based probabilities were used with the parser, the results were even better; 134 fewer multiple parses were generated. The dictionary-based approach to deriving part-of-speech probabilities needs further renements, but these results suggest there is merit in further study. The interaction between the parser, rule probabilities and the part-of-speech probabilities needs to be explored to further reduce the search space that the parser must explore to produce a good parse. 5 Discussion The use of a dictionary as the sole source of part-of-speech information has been criticized in the literature. Church [2] makes the point that dictionaries are designed to inform the user of the possible, without indicating what is most probable. Both the primarily unambiguous see and the truly ambiguous saw have noun and verb senses [2]. Without relying on some sort of probabilities, both the noun part of speech and the verb part of speech would receive equal consideration. This approach shows that the dictionary does have information to indicate the likelihood of a part of speech. For saw, 28% of the senses are noun senses but for see, only 4% of the senses are noun senses. Using the dictionary as the sole source of part-of-speech information does not mean frequent and rare parts of speech must be considered equally. This approach assumes availability of an MRD. For some languages, comprehensive dictionaries are not available. Statistical approaches have a similar problem. They require a large, balanced corpus and under most approaches, a tagged training corpus. Having a viable source beyond large corpora for determining part-of-speech probabilities increases the chances that part-of-speech information can be obtained for a particular language. In using a standard, all-purpose dictionary, our system is optimized for general purpose text, but for some tasks, being able to tailor a parser to a specic domain may be helpful. The dictionary has a wealth of domain specic information hidden in its senses. The calculations used here can be biased to favor parts of speech that occur in a specic domain. As noted earlier, cat has twelve noun senses and two verb senses, thus the verb part of speech would normally be placed low in the queue. If tailoring this system to a nautical domain, the verb sense of cat, to hoist an anchor to (the cathead), 4 could trigger the system to assign a higher count to that part of speech, causing the verb sense to be favored. Another possibility might be to use domain-specic dictionaries in combination with a comprehensive dictionary to determine the sense ratios. The sparse data problem encountered in statistical approaches persists to some degree in this approach. The greater the number of senses for a particular entry, the closer these results align with the part-of-speech probabilities generated by Richardson [14] and actual occurrence in the Brown Corpus. When the number of senses is low, the results are less reliable. To some extent, the division of words meanings into distinct dictionary senses depends on arbitrary lexicographic decisions [11]. For verbs, for example, it depends on the lexicographic traditions of a particular dictionary, whether a meaning that can be both transitive and intransitive is represented as one sense or two. Longman's Dictionary of Contemporary English (1978 edition) uses one sense, AHD uses two. The possibility that the existence of multiple transitivity states for a verb should favor frequency predictions is unexplored. One possible objection to this approach is the time and expense of acquiring an MRD. However, the MRD is an essential tool in further semantic work. To go beyond the useful but limited information provided by a syntactic parse, a rich knowledge base is necessary. Dolan, et al. [6] and Vanderwende [15] have made use of automated strategies to exploit the rich source of lexical information found in on-line dictionaries to create a highly-structured lexical knowledge base. Because the MRD is a necessary development tool, the cost of generating these part-ofspeech probabilities is negligible. Well-balanced corpora take time and resources to develop. The approach described in this paper can provide a set of part-of-speech probabilities while a parser is in the development stage, before the parser is mature enough to parse real text. Currently, it has only been tested using a mature English parser. We hope to test this approach on versions of the same system for other languages that are in earlier stages of development. From those tests, we will nd out whether this approach is useful in the development of a rule-based parser. Richardson's [14] approach for supplying part-of-speech probabilities to the parser requires a fairly mature parser to get well-balanced part-of-speech probabilities. 6 Conclusion We have described a unique approach to determining part-of-speech probabilities to use with a rule-based broad-coverage parser. The results show that by calculating the number of senses per part of speech in a comprehensive MRD to determine the most probable part of speech and then supplying that information to the rule-based parser, parses improve and fewer multiple parses are generated. A notable feature of this approach is that it does not require a mature parser or tagged corpora. In addition, the source of the probabilities, a machine-readable dictionary, is an extremely useful tool for all other levels of processing. References 1. Allen, J. 1994. Natural Language Understanding, 2nd Edition, ch. 7. New York: Benjamin/Cummings. 4 The American Heritage Dictionary of the English Language, Third Edition, copyright 1992 by Houghton Miin Company. 2. Church, K.W. 1988. \A Stochastic Parts Program and Noun Phrases Parser for Unrestricted Text." In Proceedings of the Second Conference on Applied Natural Language Processing, 136-143. Association for Computational Linguistics. 3. Church, K.W. 1992. \Current Practice in Part of Speech Tagging and Suggestions for the Future." In For Henry KuCera, eds. A.W. Mackie, T.K. McAuley and C. Simmons, 13-48. Michigan Slavic Publications, University of Michigan. 4. DeRose, S.J. 1988. \Grammatical Category Disambiguation by Statistical Optimization.". Computational Linguistics, 14(1): 31-39. 5. DeRose, S.J. 1992. \Probability and Grammatical Category: Collocational Analyses of English and Greek." In For Henry KuCera, eds. A.W. Mackie, T.K. McAuley and C. Simmons, 125-152. Michigan Slavic Publications, University of Michigan. 6. Dolan, W. B., L. Vanderwende, and S. D. Richardson. 1993. \Automatically deriving structured knowledge bases from on-line dictionaries." In Proceedings of the First Conference of the Pacic Association for Computational Linguistics, at Simon Fraser University, Vancouver, BC, pp. 5-14. 7. Garside, R. and F. Leech. 1985. \A Probabilistic Parser." In Proceedings of the Second Conference of the European Chapter of the Association for Computational Linguistics, 166-170. Association for Computational Linguistics. 8. Greene, B.B. and G.M Rubin. 1971. Automated Grammatical Tagging of English. Providence. R.I.: Brown University, Department of Linguistics. 9. Jensen, K. 1993. \PEG: the PLNLP English grammar." In Natural Language Processing: the PLNLP Approach, eds. K. Jensen, G. Heidorn, and S. Richardson, 29-45. Boston: Kluwer Academic Publishers. 10. Johansson, S., G.N. Leech, and H. Goodluck. 1978. Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for the Use with Digital Computers. Department of English, University of Oslo. 11. Kilgarri, A. 1993. \Dictionary Word Sense Distinctions: In An Enquiry Into Their Nature." In Vol. 26: 365-387. 12. KuCera, H., and W.N. Francis. 1967. Computational Analysis of Present-day American English. Providence, R.I.: Brown University Press. 13. Kupiec, J., and J. Maxwell. 1992. \Training stochastic grammars from unlabelled text corpora." In AAAI-92 Workshop Program on Statistically-Based NLP Techniques. San Jose, Ca: 14-19. 14. Richardson, S. D. 1994. \Bootstrapping Statistical Processing into a Rule-based Natural Language Parser." In Proceedings of the ACL Workshop \Combining symbolic and statistical approaches to language", pp. 96-103. 15. Vanderwende, L. 1995. \Ambiguity in the acquisition of lexical information." In Proceedings of the AAAI 1995 Spring Symposium Series, working notes of the symposium on representation and acquisition of lexical knowledge, 174-179. 16. Zipf, G.K., 1949, Human Behavior and the Principle of Least Eort. Cambridge: Addison-Wesley Press, Inc. 7 Acknowledgements I would like to thank the other members of the Microsoft NLP group for their encouragement and insight: Mike Barnett, Bill Dolan, George Heidorn, Karen Jensen, Joseph Pentheroudakis, Steven Richardson and Lucy Vanderwende.
© Copyright 2026 Paperzz