Deriving Part of Speech Probabilities from a Machine

Deriving Part of Speech Probabilities from a
Machine-Readable Dictionary
Deborah A. Coughlin
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
[email protected]
Abstract. A method to add part-of-speech statistics, derived from a dictionary, to a
rule-based broad-coverage parser is described. By entering the most likely part-of-speech,
determined by the number of senses per part-of-speech in a standard dictionary, to the
parser, parsing accuracy improves. This method improves the parser's accuracy without
requiring a large corpus or manual encoding. Part-of-speech statistics derived this way
may prove useful for languages that do not have large well-balanced corpora available.
1 Introduction
A Natural Language Processing system, to achieve broad coverage, needs a complete lexicon.
This lexicon must include frequent and seldom-used terms. Even rare parts of speech need to
be represented when attempting broad-coverage.
Manually coding the required information is both time-consuming and error-prone. A standard on-line dictionary, though, represents centuries of hand-coding by skilled lexicographers.
It is meant for broad coverage in the sense that it is intended as a lexical tool for human users
in language tasks, regardless of the topic or domain. Machine-readable dictionaries (MRDs) are
expanding their user base to include NLP systems because they provide a large and complete
lexicon needed for broad coverage.
Though dictionaries prove useful as sources of comprehensive lexicons for natural language
parsers, their completeness introduces ambiguity that is not easily resolved. The importance
of handling part-of-speech ambiguity cannot be overstated. The American Heritage Dictionary
(1992 edition) has approximately 18,500 words with multiple parts of speech which represents
around 12% of the total number of entries. (Inected forms are included.) These part-of-speech
ambiguous words are common words found in many genres. DeRose [5] found that though
only 11% of word types in the Brown Corpus were part-of-speech ambiguous, that number
represented 48% of the actual tokens, evidencing that words which are part-of-speech ambiguous
tend to be common well-used words. Table 1 provides an example of a sentence that contains
many part-of-speech ambiguous tokens.
As Church [3] points out, relying solely on an augmented transition network (ATN) to
determine the most probable part of speech leads to multiple and unlikely parses. This is most
likely true for all broad-coverage rule-based approaches. To accomplish broad-coverage, a parser
must be able to analyze the variety of structures found in real text. When there are multiple
words which are ambiguous with respect to their part of speech in a single sentence, determining
the most probable parse becomes a dicult undertaking. This problem becomes extreme when
truly broad-coverage parsing is attempted.
The brief notes introducing each work oer salient historical or technical points
Adj Noun Noun Verb
Adv Verb Verb Adj Adj
Conj Adj
Noun
Adv Adj Verb
Adj Noun Noun Noun
Noun Noun Verb
Verb
Pron Adj
Table 1. An Example from the Brown Corpus of a Sentence with Many Part-of-Speech Ambiguous
Tokens
It is computationally imperative that the parser be able to choose the most probable parse
from the potentially large number of possible parses. Further processing of the input quickly
becomes complex and inecient if more than one parse must be considered [9]. Adding part-ofspeech probabilities to the parsing process would guide the parser to produce the most likely
of the possible parses.
Over the last twenty-ve years, statistical models for part-of-speech tagging of text have
made great strides in terms of simplicity of the algorithms used and accuracy of the output
[4]. Green and Rubin [8] needed a large rule database for tagging the Brown Corpus. Garside
and Leech [7], DeRose [4], Church [2], and others have improved the eciency and accuracy of
tagging algorithms while reducing the rule database. Nevertheless, the need for large training
corpora remains. Statistical approaches usually require a training corpus that has been manually
tagged with part-of-speech information.
The Microsoft NLP parser is an empirically-based, broad-coverage, rule-based chart parser.
It makes use of augmented phrase structure grammar rules, described in Jensen [9], to provide
a syntactic parse that can be exploited later to determine functional roles and long-distance
dependencies. Resolving part-of-speech ambiguities is a dicult task for our parser as it is for
other broad-coverage parsers. Statistical information derived from large corpora has proved
useful in this regard. Richardson [14] used our rule-based parser to derive part-of-speech and
rule probabilities from untagged corpora. He then incorporated those part-of-speech and rule
probabilities into our parser, improving its speed and accuracy. This approach assumes the
availability of a large corpus and a fairly comprehensive parser.
Large well-balanced corpora like the Brown Corpus [12] and the Lancaster-Oslo/Bergen
(LOB) Corpus [10] are available for relatively few languages. Other sources of part-of-speech
probabilities would prove useful for languages without these valuable resources. The remainder
of this paper explores the use of MRDs as a potential source of part-of-speech probabilities.
2 Dictionary as a Source of Part-of-Speech Probabilities
Zipf [16] found that on average, the more frequently a word occurs, the more senses it had (in
the dictionary). Extending Zipf's law a bit further, we postulate that for polysemous words, the
more frequent a part of speech, the more senses that part of speech will have. The word follow,
for example, has twenty-one verb senses, ve phrasal verb senses and two noun senses in the
American Heritage Dictionary (1992 edition). Though we cannot claim that follow is thirteen
times more likely to be a verb, the greater number of verb senses coincides with our intuitions
about the more likely part of speech.
The method is conceptually simple. We begin with a machine-readable version of the American Heritage Dictionary (AHD). The AHD classies words into eight parts of speech: noun,
verb, adjective, adverb, pronoun, preposition, conjunction and interjection. For each dictionary
entry, we tabulate the number of senses for each part of speech. For instance, the word school
has fourteen noun senses, three verb senses and one adjective sense in the AHD. A part-ofspeech probability is then derived based on the number of senses counted. The part of speech
with the highest sense count is considered the most likely part of speech. Thus, for school, the
noun part of speech would be considered most probable. Though it is incorrect to consider these
calculations as the probability of a word to occur in a particular part of speech in real text, we
are claiming that for most entries, the larger the ratio of the number of senses per a part of
speech to the total number of senses, the more probable that part of speech.
Part-of-speech probabilities are also determined for inected forms. Inected forms for dictionary entries are determined automatically by rule-based generation of inectional paradigms
and information provided by the AHD. Inected forms are treated as lexicalized entries for
the initial computation. For the word cats, for example, a noun and verb record are generated.
Because cats can be both a plural noun and a present tense, third-person, singular verb, the
sense counts from both the noun and verb senses of cat are assigned to cats. For catting, only
the verb senses of cat would be considered. There are twelve noun senses and two verb senses
for cats. For this word, the Brown Corpus has nine occurrences of cats as a noun and no verb
senses. Statistics derived from the Brown Corpus would favor the noun reading even more than
this approach but the relative ranking of the particular parts of speech remains the same. The
distance between the noun and verb calculations can be used to further penalize the unlikely
verb part of speech.
If an inected form is already a dictionary entry (i.e., if it has been assigned senses in the
dictionary), the sense counts of the inected form are combined with the counts for the base
form. For fell, there are two verb senses, six noun senses and four adjective senses. We add the
verb sense counts from fall because fell is also the past tense of fall. Not until the twenty-four
verb senses found under fall are added into the calculations do the numbers look reasonable.
In the Brown Corpus, fell is a verb 100% of the time. If we use the ratios derived from this
method to indicate likelihood of a particular part of speech, fell would have an 89% probability
of being a verb. Though this approach does not \equal" the part-of-speech calculations that a
statistical approach would generate, it does provide similar results.
Figure 1 describes the approach to tabulating part-of-speech sense counts. Once the sense
counts are tabulated, they are given to the parser, in hopes that this information will improve
the parse quality and reduce the of size of the search space required to get a good parse.
1. Generate inected forms for each dictionary entry.
2. For each entry (inected forms included) in the MRD,
{ Calculate the number of senses for each part of speech.
{ If entry is an inected form, calculate the number of senses for the lexeme of the part of speech
of the inected form. Add the entry and lexeme counts.
Figure1. Algorithm for Determining Part-of-Speech Sense Counts.
3 The System Using Part-of-Speech Probabilities
Our project makes use of a rule-based chart parser. In an eort to avoid the inecient allpaths, breadth-rst approach to parsing, we employ a strategy described in Richardson [14].
This approach is similar to the best-rst parsing algorithm Allen [1] describes. The Richardson
algorithm used in our system makes use of a priority queue. All part-of-speech records 1 for
all words in an input string are placed in the queue. For each word in the input string, the
part-of-speech record considered most probable is placed at the top of the queue. This assures
that for each word in the input, the most probable part-of-speech record is made available to
the parser initially. The rest of the queue contains both rule and part-of-speech records sorted
highest probability rst.
For each word in an input string, the value of the part-of-speech calculation for the part of
speech considered most probable is not relevant. That part of speech is given a probability of 1.0
so that it is placed at the top of the queue. For the other part-of-speech records for each entry,
the value of the part-of-speech calculation determines how high it is placed in the queue. Rules
are attempted and part-of-speech records are entered into the chart in the order they are found
in the queue. Each time a part-of-speech record enters the chart, all applicable rules (based
on examination of constituent sequences) are also placed in the queue, positioned according to
their associated probabilities. When a parse tree for the entire input string is found, the process
ends.
Though the set of augmented phrase structure grammar rules that the chart parser relies on
are optimized to produce just one parse for each sentence, multiple parses and tted 2 parses are
also possible. We used the sense ratio information from the dictionary to reduce the search space
and produce the most likely parse rst. This leads to a reduction in the number of multiple and
tted parses (for grammatical sentences) and makes the process less computationally expensive.
Richardson [14] provides a complete description of this queuing process. He also describes
the process that generated the rule probabilities and normalized the part-of-speech to the rule
probabilities. The only dierence between his system and the one being described here is the
source of the part-of-speech data. For his system, the Brown Corpus was used to compute
part-of-speech probabilities; in our approach, a dictionary was used.
4 Results
The AHD in machine-readable form was scanned using the algorithm described above to obtain
the senses counts for each entry. The process took less than thirty minutes on a Pentium/100 PC.
The resulting values were made available to the parser by a probabilistic algorithm described
in Richardson [14].
The parser was applied to three sets of test sentences to determine whether this approach was
helpful in improving parsing accuracy and eciency. The tests were run using no probabilities
and using the part-of-speech probabilities derived from the MRD.
The rst test was a list of 2800 linguistic textbook example sentences. This insured that
there was a wide variety of linguistically complex structures in the test le. The average sentence
length was 10 words. Of the 179 sentences that parsed dierently when dictionary-derived
probabilities were added, 64% of the parses generated using these probabilities were better,
while 23% represented regressions. Fitted parses were reduced by 45% using these part-ofspeech probabilities. There were 114 tted parses for the 2800 sentence when no probabilities
Part-of-speech records are similar to edges in traditional chart parsing. These part-of-speech records
represent the union of the syntactic and morphological information contained in the senses for that
part of speech. There is only one record per part of speech for a given word.
2
Fitted parses are assigned to strings which cannot be analyzed as fully well-formed sentences according
to the grammar embodied in the parser.
1
were used. That number was reduced to 63 using these part-of-speech probabilities. These test
results are in Table 2.
Better Parse
Type of Parse Changes When
64%
Probabilities Added
Worse Parse
23%
Other 3
13%
Fewer Fitted Parses
45%
Table 2. Test le 1: Nature of Changes in Parsing When Dictionary-Derived Part-of-Speech Probabilities Were Added to Parser.
The same test was run using the Brown Corpus derived part-of-speech probabilities generated by Richardson [14]. The parser did better using Richardson's probabilities than it did
using dictionary derived probabilities. There were 170 dierent parses when corpus-based probabilities were added. Of those 170 dierences, 74% represented improvements. The number of
tted parsed decreased by 47%.
It is not surprising that the results using Richardson's corpus-based probabilities were better
that those using dictionary-based probabilities. The parser has been using the part-of-speech
probabilities generated by Richardson for over two years. Those two years of optimizing the
system with those part-of-speech probabilities give it an advantage. Richardson's probabilities
include rule probabilities. These same rule probabilities were used for both approaches.
The second set of test sentences were 523 randomly selected from the Brown Corpus. The
average length of these sentences was 21.5 words. Because of the random nature of the selection
method, all types of sentences, including ones containing mathematical formulas were selected.
No hand editing was done. In this test, 117 dierences were found when comparing the parses
using no probabilities to parses generated using the dictionary-based probabilities. Of those
117, 37% were improvements, while 14% were regressions. The remaining 49% represented
parse dierences that could not be considered qualitatively dierent.
The same test was applied to the parser using Richardson's part-of-speech probabilities.
In this case the results are similar to the dictionary-based results. Of the 121 dierences, 35%
represented improvements, while 15% represented regressions. Again, 49% of the dierences were
found to be neither improvements nor regressions. Both the corpus-based and the dictionarybased results are presented in Table 3.
Dictionary-Based Probabilities
Corpus-Based Probabilities
Better Parse
37%
35%
Worse Parse
14%
15%
Other
49%
49%
Table 3. Test le 2: When Compared to No Probabilities, Percentage of Change that Represented
Improvements, Regressions and No Qualitative Dierence for the Two Approaches.
The third test consisted of a selection of both linguistic example sentences and text from a
variety of real world sources: The Wall Street Journal, Time Magazine, etc. The average length
3
The category Other indicates those parses were not qualitatively dierent.
of these sentences was 17 words. We used this test to focus on the number of multiple parses
generated. We assume that a reduction in the number of multiple parses indicates an improvement in the parser's eciency and accuracy. The quality of these parses was not examined,
but the rst two test results give us reason to believe that this is true. The parser was set up
to allow multiple parses for this test. For the 600 sentence le, the number of sentences that
received a non-tted parse was compared to the total number of parses generated; these results
are summarized in Table 4.
No. of Sentences with Total Number
a Parse
Parses
No Probabilities
529
1059
Dictionary-Based Probabilities
536
971
Corpus-Based Probabilities
546
925
of Average Number of
Parses
2.0
1.8
1.7
Table 4. Test le 3: Average Number of Parses per Sentence Successfully Parsed for a 600 Sentence
Test File.
The dictionary-based approach reduced the total number of multiple parses by 88 parses
for this 600 sentence test le. When corpus-based probabilities were used with the parser, the
results were even better; 134 fewer multiple parses were generated.
The dictionary-based approach to deriving part-of-speech probabilities needs further renements, but these results suggest there is merit in further study. The interaction between the
parser, rule probabilities and the part-of-speech probabilities needs to be explored to further
reduce the search space that the parser must explore to produce a good parse.
5 Discussion
The use of a dictionary as the sole source of part-of-speech information has been criticized in
the literature. Church [2] makes the point that dictionaries are designed to inform the user of
the possible, without indicating what is most probable. Both the primarily unambiguous see
and the truly ambiguous saw have noun and verb senses [2]. Without relying on some sort of
probabilities, both the noun part of speech and the verb part of speech would receive equal
consideration. This approach shows that the dictionary does have information to indicate the
likelihood of a part of speech. For saw, 28% of the senses are noun senses but for see, only 4% of
the senses are noun senses. Using the dictionary as the sole source of part-of-speech information
does not mean frequent and rare parts of speech must be considered equally.
This approach assumes availability of an MRD. For some languages, comprehensive dictionaries are not available. Statistical approaches have a similar problem. They require a large,
balanced corpus and under most approaches, a tagged training corpus. Having a viable source
beyond large corpora for determining part-of-speech probabilities increases the chances that
part-of-speech information can be obtained for a particular language.
In using a standard, all-purpose dictionary, our system is optimized for general purpose
text, but for some tasks, being able to tailor a parser to a specic domain may be helpful. The
dictionary has a wealth of domain specic information hidden in its senses. The calculations
used here can be biased to favor parts of speech that occur in a specic domain. As noted earlier,
cat has twelve noun senses and two verb senses, thus the verb part of speech would normally
be placed low in the queue. If tailoring this system to a nautical domain, the verb sense of
cat, to hoist an anchor to (the cathead), 4 could trigger the system to assign a higher count to
that part of speech, causing the verb sense to be favored. Another possibility might be to use
domain-specic dictionaries in combination with a comprehensive dictionary to determine the
sense ratios.
The sparse data problem encountered in statistical approaches persists to some degree in
this approach. The greater the number of senses for a particular entry, the closer these results
align with the part-of-speech probabilities generated by Richardson [14] and actual occurrence
in the Brown Corpus. When the number of senses is low, the results are less reliable. To some
extent, the division of words meanings into distinct dictionary senses depends on arbitrary
lexicographic decisions [11]. For verbs, for example, it depends on the lexicographic traditions
of a particular dictionary, whether a meaning that can be both transitive and intransitive is
represented as one sense or two. Longman's Dictionary of Contemporary English (1978 edition)
uses one sense, AHD uses two. The possibility that the existence of multiple transitivity states
for a verb should favor frequency predictions is unexplored.
One possible objection to this approach is the time and expense of acquiring an MRD.
However, the MRD is an essential tool in further semantic work. To go beyond the useful but
limited information provided by a syntactic parse, a rich knowledge base is necessary. Dolan, et
al. [6] and Vanderwende [15] have made use of automated strategies to exploit the rich source of
lexical information found in on-line dictionaries to create a highly-structured lexical knowledge
base. Because the MRD is a necessary development tool, the cost of generating these part-ofspeech probabilities is negligible. Well-balanced corpora take time and resources to develop.
The approach described in this paper can provide a set of part-of-speech probabilities while
a parser is in the development stage, before the parser is mature enough to parse real text.
Currently, it has only been tested using a mature English parser. We hope to test this approach
on versions of the same system for other languages that are in earlier stages of development.
From those tests, we will nd out whether this approach is useful in the development of a
rule-based parser. Richardson's [14] approach for supplying part-of-speech probabilities to the
parser requires a fairly mature parser to get well-balanced part-of-speech probabilities.
6 Conclusion
We have described a unique approach to determining part-of-speech probabilities to use with a
rule-based broad-coverage parser. The results show that by calculating the number of senses per
part of speech in a comprehensive MRD to determine the most probable part of speech and then
supplying that information to the rule-based parser, parses improve and fewer multiple parses
are generated. A notable feature of this approach is that it does not require a mature parser or
tagged corpora. In addition, the source of the probabilities, a machine-readable dictionary, is
an extremely useful tool for all other levels of processing.
References
1. Allen, J. 1994. Natural Language Understanding, 2nd Edition, ch. 7. New York: Benjamin/Cummings.
4
The American Heritage Dictionary of the English Language, Third Edition, copyright 1992 by
Houghton Miin Company.
2. Church, K.W. 1988. \A Stochastic Parts Program and Noun Phrases Parser for Unrestricted Text."
In Proceedings of the Second Conference on Applied Natural Language Processing, 136-143. Association for Computational Linguistics.
3. Church, K.W. 1992. \Current Practice in Part of Speech Tagging and Suggestions for the Future."
In For Henry KuCera, eds. A.W. Mackie, T.K. McAuley and C. Simmons, 13-48. Michigan Slavic
Publications, University of Michigan.
4. DeRose, S.J. 1988. \Grammatical Category Disambiguation by Statistical Optimization.". Computational Linguistics, 14(1): 31-39.
5. DeRose, S.J. 1992. \Probability and Grammatical Category: Collocational Analyses of English
and Greek." In For Henry KuCera, eds. A.W. Mackie, T.K. McAuley and C. Simmons, 125-152.
Michigan Slavic Publications, University of Michigan.
6. Dolan, W. B., L. Vanderwende, and S. D. Richardson. 1993. \Automatically deriving structured
knowledge bases from on-line dictionaries." In Proceedings of the First Conference of the Pacic
Association for Computational Linguistics, at Simon Fraser University, Vancouver, BC, pp. 5-14.
7. Garside, R. and F. Leech. 1985. \A Probabilistic Parser." In Proceedings of the Second Conference
of the European Chapter of the Association for Computational Linguistics, 166-170. Association for
Computational Linguistics.
8. Greene, B.B. and G.M Rubin. 1971. Automated Grammatical Tagging of English. Providence. R.I.:
Brown University, Department of Linguistics.
9. Jensen, K. 1993. \PEG: the PLNLP English grammar." In Natural Language Processing: the
PLNLP Approach, eds. K. Jensen, G. Heidorn, and S. Richardson, 29-45. Boston: Kluwer Academic
Publishers.
10. Johansson, S., G.N. Leech, and H. Goodluck. 1978. Manual of Information to Accompany the
Lancaster-Oslo/Bergen Corpus of British English, for the Use with Digital Computers. Department
of English, University of Oslo.
11. Kilgarri, A. 1993. \Dictionary Word Sense Distinctions: In An Enquiry Into Their Nature." In
Vol. 26: 365-387.
12. KuCera, H., and W.N. Francis. 1967. Computational Analysis of Present-day American English.
Providence, R.I.: Brown University Press.
13. Kupiec, J., and J. Maxwell. 1992. \Training stochastic grammars from unlabelled text corpora."
In AAAI-92 Workshop Program on Statistically-Based NLP Techniques. San Jose, Ca: 14-19.
14. Richardson, S. D. 1994. \Bootstrapping Statistical Processing into a Rule-based Natural Language
Parser." In Proceedings of the ACL Workshop \Combining symbolic and statistical approaches to
language", pp. 96-103.
15. Vanderwende, L. 1995. \Ambiguity in the acquisition of lexical information." In Proceedings of
the AAAI 1995 Spring Symposium Series, working notes of the symposium on representation and
acquisition of lexical knowledge, 174-179.
16. Zipf, G.K., 1949, Human Behavior and the Principle of Least Eort. Cambridge: Addison-Wesley
Press, Inc.
7 Acknowledgements
I would like to thank the other members of the Microsoft NLP group for their encouragement
and insight: Mike Barnett, Bill Dolan, George Heidorn, Karen Jensen, Joseph Pentheroudakis,
Steven Richardson and Lucy Vanderwende.