Using a Hybrid System of Corpus- and Knowledge-Based Techniques
to Automate the Induction of a Lexical Sublanguage Grammar
Geert Jan Wilms
Union University
2447 Hwy 45 Bypass Box 1857. Jackson, TN 38305. USA
jwilms @buster.uu.edu
Abstract
Porting a Natural Language Processing
(NLP) system to a new donmin renmins one of the
bottlenecks in syntactic parsing, because of the
amount of effort required to fix gaps in the lexicon,
and to attune the existing grammar to the idiosyncracics of the new sublanguage. This paper shows how
thc process of fitting a lexicalizcd grammar to a
domain can be automated to a great extent by using a
hybrid system that combines traditimml knowledgebased techniques with a corpus-based approach.
1. Porting Bottleneck
The trMitional gramnmr knowledgebase is
the product of a never-ending attempt by linguists to
impose order on something that refuses to be pinned
down because it is a living thing. To a great extent,
of course, these linguists are able to point to
regularities, because language is first of all a practical
thing, a means to communicate, and there must be a
colnmon base for such transfer to take phtce. But all
rules have exceptions, and often it turns out these
exception s are not isolated or random, so tile rule is
finetuned. The problem is that what is "grannnatical"
depends on the tmwritten rules of a certain domain.
When the core grammar is augmented to acconnnodate
all these idiosyncracies, the danger is not that an
ungrammatical sentence might slip through, but that
perfectly legitimate input receives an incorrect
analysis that is sanctioned by some peripheral
grammar rule that doesn't apply to the domain under
investigation. The semantic cmnponent which gets
this false positive may reject it and request a second
reading, and the correct parse will most probably
come down the pipeline eventually if the grammar is
truly broad-coverage, but a semantic module is not
always well equipped to detect such errors and may
have a difficult time enough trying to resolve
attachment problems, anaphoric references, etc., even
when presented with the "right" parse.
In systems that use a lexical grammar, i.e.,
whore part of the grammatical "knowledge" is stored
outside the non-terminals of the grammar proper,
using subcategorization frames associated with
terminals (words in Ihe lexicon), the peril likewise {s
that this resource becomes bhmted over time with
options exercised only in certain settings or when the
word is used in a marginal sense.
Clearly something must be done to separate
1163
the wheat from the chaff; the problem is twofold:
getting the grammar and lcxicon to a ccrtain level of
competence was a laborious and timc-consmning
process, and undoing this (i.e., eliminating unwanted
options) is ahnost as difficult and painfifl as the
constant augmenting in the first place. And secondly,
what constitutes wheat and chaff is different for each
domain, so this "dieting" must bc repeated lot every
port.
Corpus-based techniques can help automate
this filtering, i.e., the source text should be viewed
not only as an "obstacle" to be tamed (parsed), but as
a resource that is best authority on what is
grammatical for the domain.
2. Data-Driven Attuning
Since the carly 90s, there has been a surge of
intcrest in corpus-based NLP rescarch; some
researchers have tackled the grammar proper, making
it a probabilistic system,'or doing away with a rulebased system altogethcr and inducing a customizcd
grammar from scratch using stochastic methods.
Dcspite the shortcomings of knowlcdgc-based
systems, it seems wrong to throw away all that has
been gained, imperfect as it is. Rather, a hybrid
system shoukl be developed where the strengths of
both paradigms arc combined. A good example el that
is a probabilistic Contcxt Free Grammar.
Both Brcnt (1993) and Manning (1993), who
attempt to induce a lexicon of subcategorization
features do so by completely discarding all preexisting knowledge; both systems are stand-ahmc,
without a parsing engine to test or use the "learned"
information. Brcnt in fact takes the "fronl scratch" to
an extreme, and models his system aftcr the way a
child learns to understand hmguage. The algorithm of
both authors basically inw)lves a pattern matcher that
scans the input for a verb, and once an anchor is
found, its right context is searched for cues lot
subcategorization fi'ames. Brent's cues are very
primitive, but because hc only picks up frmnes when
the indicators are mmmbiguous, his results are very
reliable, albeit sparse (unless a very large training
corpus is used). Manning's triggers on the other hand
are more sophisticated, but because they are less
dependable he must rely on heavy statistical filtering
to reduce the "noise." Although Manning's work in
inducing features certainly accomplishes the goal of
customizing the lexicon to a particulm" domain, the
porting process is still very much a manual enterprise
in that he must write a mini-parser, a finite state
machine that includes an NP recognizer "and various
other rules to recognize certain cases that appear
frequently" (1993, 237).
The dilemma of any pattern matching
approach is in essence a bootstrapping problem; if the
goal is to induce syntactic information (in the form of
lexical features), then paradoxically some heavy
syntactic processing power is needed to "parse" the
training data to mine for evidence that a particular
verb subcategorizes for an object option, while
avoiding false triggers (imposter patterns). Manning
has built into his finite state device a panic mode to
skip over ambiguous elements, but the trick is to
recognize when things get hairy; that is where a lot of
programming el'tort takes place, and this finetuning is
never over (and must be repeated for every port to a
new domain) as Manning himself admits (1993, 238).
3. Category Space of Context Digests
The category space described in this paper
uses a very different approach to induce
subcategorization frames; instead of starting fi'om
scratch, the existing rich lexicon is exploited and
features are assigned to new words based on their
paradigmatic relatedness to known words. Thus
instead of having to "hunt" for evidence, this approach
is able to exploit the expertise of seasoned linguists
who constructed the initial lexicon, which was
intentionally designed to be broad-coverage. Such a
strategy not only avoids having to distinguish good
cues from irrelevant triggers, but is capable of
inducing some features like ASSERTION for which
there is no marker that would indicate its presence.
A category space is a multi-dimensional
space in which the syntactic category of words is
represented by a vector of co-occurrence counts
(Schiitze 1993). Proximity between two such vectors,
or context digests, can be used to measure the
paradigmatic relatedness of the words they represent
(Schtitze and Pedersen 1993). Paradigmatic relatedness
indicates how well two words can be substituted lbr
each other, i.e., how similar their syntactic behavior
is. This is not the same as the synonym relationship,
which is based on semantic similarity.
There are two general approaches in the
literature to collecting distributional information:
window-based and syntactically-based (Charniak
1993). In the latter scheme the text is scanned until a
section is found that is deemed to be relevant. The
"rough" structure of the sentence is computed, a
process known as partial parsing. This produces a flat
tree with phrase boundaries marked and identified by
type, but without much internal detail.
A second approach to collecting relevant
distributional information is to keep co-occurrence
1164
counts of the nearest lexical neighbors of a word,
usually within a fixed distance or "window." Markov
models, for example, predict the POS of a word based
on the tags of the two or three words preceding it
(bigrams and trigrams respectively). Schatze has
experimented with window lengths of four words, two
hundred letter lburgrams and two thousand characters
(Schtitze 1993).
In the research presented here, a window of
tour was adopted, i.e., for words of interest in the
domain of physical chemistry, co-occurrence counts
were kept between those words and their immediate
left neighbors (Wi_l wi), immediate right neighbors
(wi Wi+l), and left and right neighbors that are two
words away (wi-2 wi and wi wi+2 respectively).
One importance difference between the
category space reported here fi'om the one in Schiitze
and Pedersen (1993) is that words were disambiguated
by part of speech so as not to mix up context
information of unrelated tokens, a problem Sch/itze
acknowledges plagues his system (1993, 254). The
corpus was tagged using Brill's tagger (Brill 1993),
which is based on what he calls transformatiombased
error-driven learning. 1430 word types tagged as verbs
occurred frequently enough (>10x) in the training
corpus to warrant constructing a vector or context
digest. As Zipf's law would predict, there is a long
tail of word types which occur too infrequently to
permit gathering useful statistics.
Each window of the context digests tracks
co-occurrence counts with word types of ~
POS,
provided these types have a minimum frequency of
100 in the training corpus. For "rare" neighbors, the
algorithm simply records the neighbor's POS, a
compromise to keep the size of the arrays manageable, while providing some information on the
syntactic context.
Context digests are formed by combining the
4 fixed windows, each consisting of co-occurrence
counts with 5,509 possible neighbors. In addition,
some limited long(er)-distance information is
appended to the vector: the training corpus has been
augmented with bracketing information, that is, with
implicit trees that exhibit binary branching, but
whose nonterminals are unlabelled. This is another
application of Brill's transformation-based error-driven
learner (Brill 1993), which was trained on 32,000
bracketed sentences from the Penn Treebank. These
phrasal boundaries are of variable length, and can in
fact span the whole sentence. Ideally, the name of the
type phrase that the verb occurred in should be used as
a clustering feature, but since this information is
unavailable (the non-terminals in the trees implicit in
the bracketing are unlabelled) the next best thing is
used, and each boundary is marked by a pair of tags
occurring on either side of the bracket.
Each context digest for verbs, then, contains
27,654 possible entries. The resulting matrix is very
sparse, however; the density for the verb category
space is only 1.5 percent. Hence the distributional
information is generalized by means of a matrix
manipulation method called Singular Value
Decomposition (SVD). This technique is el'ten used
in factor analysis, because reducing the representation
to a low dimensionality allows one to better visualize
the space, lit is exactly this compactness of
representation that has led Schtitze to apply SVD to
the field of NLP, to reduce the number of input
parameters to a neural net, without sacrificing too
many of the fine distinctions in the original text
(Schiitze 1993). Deerweester et al. (1990) introduced
SVD to the field of inlk)rmation retrieval lor improved
docmnent representations; the original term-document
matrix is decomposed into linearly independent
factors, many (5t' which are very small. An
approximate model with fewer dimensions can be
constructed by ignoring these small components. By
combining only the first k linearly independent
components, a reduced model is built which disregards
lesser terminology variations, because k is smaller
than the number of rows (terms).
To generalize the associational patterns in
the category space that was bootstrapped from the
physical chcmistry corpus, SVD was applied with it
conservative value for k of 350. The tool used for this
purpose was a slightly modified version of the las2
module from the SVDPACKC package (Berry et al.
1993). Tim generalizing effect of SVD causes the
category space for verbs to become much less sparse:
35.4 percent of the entries now have non-zero
%ounts." Most of these are new counts, i.e. SVD
infers context similarities between words that may not
be apparent in the original co-occurrence matrix due to
the natural randomness in any corpus sample. The
average number (51' context digests that are very
similar (greater than 97 percent confidence) remains
fairly constant alter SVD, but the dimension reduction
provides a lot more information about syntactic
behavior when a less strict cutoff value is adopted (say
90 percent).
4. Induction based on Neighborhoods
Proximity in this reduced space is then used
to find for all the context digests a neighborhood of
words that are paradigmatically related. Proximity can
be computed by using the cosine similarity measure,
which was a major feature of the SMART
information retrieval system (Salton 1983). This
measures the cosine of the angle between two context
digests, which can be viewed as vectors in a sdimensional space.
The category space can be clustered by
comparing pairs of context digests using the cosine
similarity measure; such clusters contain words whose
1165
syntactic behavior is substantially similar. The degree
of similarity depends on the adopted threshold value.
However, these neighborhoods are not
traditional clnsters; each verb has its own individual
representation in a multi-dimensional space, i.e. is the
center of its own neighborhood. Typically any given
verb is a vector which silnultaneously belongs in
several neighborhoods.
Verbal subcatcgorization frames like
transitivity, or the ability to take a that-complement
or to-infinitive can be induced for new words based on
a "composite" of features associated with "similar"
verbs that are. defined in the lexicon. The
knowledgebase used in this research is the domainindependent lexicon of PUNDIT, a broad-coverage
symbolic NLP system, which contains 164 verbs
with detailed subcategorization inl'ormation
(Hirsehman et al. 1989). PUNDIT's features are a
subset (51"Sager's Linguistic String Project (Sager
1981), which include sebctional restrictions, features
that license constructs, and object options that affect
the interpretation of a sentence.
The induction works as follows: each verb
lms its own neighborhood, formed by computing the
cosine similarity weight between it and all other verbs
in the category space, and by retaining those whosc
weight excecds a certain threshold. If there are no
nearby verbs with known teatures, more remote words
can be used for deciding on whether a certain feature
should apply to the verb being examined, especially if
a substantial majority of these "distant relatives" are
in agreement. If the features are treated as boolean
values (present/not present), it will most certainly
happen in neighborhoods with liberal cutoff points
that there will be some disagreement for individual
options, so a heuristic must negotiate these
"eonllicts" and settle for the best abstraction. Such a
heuristic should have the following three characteristics:
1) verbs that are close to the word being examined
should carry more weight in the decision process
than verbs that are closer to the perimeter.
2) both positive and negative evidence (the absence
of a feature for a particular verb) should be
considered.
3) given the fact that the presence of a feature is the
result of a positive decision/action (by a
linguist), whereas the absence may be an
oversight, there should be it (slight) bias in favor
of the former; the sensitivity threshold can bc
adjusted by shifting the point at which the weight
of evidence is considered sufficient to decide in
favor of adopting the feature.
The existing verbs in the lexicon themselves
undergo a similar process whereby they are fitted to
the domain: some of their "generic" features which me
not appropriate m'e dropped, whereas "gaps" in object
options are filled. The net result is that the grammar
becomes attuned to the sublanguage: parses become
possible because the enabling features are present,
while the search space is pruned of many false
positives because unnecessary features are omitted.
tional Linguistics: Special Report on Using ~Large
Corpora: II 19 (June): 243-62.
Brill, Eric. 1993. A corpus-based approach to
language learning. Ph.D. diss., University of
Pennsylvania.
5. Evaluation
Manning evaluates his system by computing
precision and recall scores with the OALD dictionary
as golden standard. However, precision is no__!ta good
yardstick for evaluating the performance of the
induction process, because it measures the outcome
against a "flawed" lexicon; the induced features,
because of the data-driven nature of the process, are
more "precise" when measured against the "real world"
of the sublanguage domain than the hand-built entries
that are the product mostly of introspection and
anecdotal evidence. The system described in this paper
was tested instead by comparing the number of
successful parses of a held-out test corpus before and
after customizing the lexicon. Out-of-the-box
PUNDIT returned 42 parses for the 170 sentences in
the training corpus (some of which were false
positives), versus 94 successful parses using the
attuned lexicon. It should be pointed out that these 94
sentences contain an average of 2.14 verbs.
Charniak, Eugene. 1993. Statistical
learning. Cambridge, MA: MIT Press.
Deerweester, Scott, Susan Dumais, George Furnas,
Thomas Landauer, and Richard Harshman. 1990.
Indexing by latent semantic analysis. Journal of the
American Society for Information Science 41
(September): 391-407.
Hirschman, Lynette, Martha Palmer, John Dowding,
Deborah Dahl, Marcia Linebarger, Rebecca
Passonneau, Francois Lang, Catherine Ball, and Carl
Weir. 1989. The PUNDIT natural language
processing system. In Proceedings of the annual
artificial intelligence systems in government
conference held in Washington, D.C., March, 1989.
Manning, Christopher. 1993. Automatic acquisition
of a large subcategorization dictionary from corpora.
In Proceedings of the thirty-first annual meeting of
the ACL _held in Columbus, OH, June, 1993, by the
Association for Computational Linguistics. 235-42.
6. Conclusion
The category space is the arbiter of
paradigmatic relatedness, and since it is bootstrapped
from a training corpus that is.representative for the
domain sublanguage, the resulting lexical entries will
be customized for that domain. Porting the lexicon to
a new domain is as simple as bootstrapping another
category space. Experiments with PUNDIT, a broadcoverage symbolic NLP system, have shown that the
category space can successfully bc used to induce
features like transitivity and subcategorization for
clauses and infinitival complements.
The advantage of combining data-driven
mining with the existing lexical knowledgebase over
other bootstrapping methods is that this approach
does not require the manual identification of
appropriate cues for subcategorization features, or the
involved construction of a pattern matcher that is
sophisticated enough to ignore false triggers.
7. References
language
Sager, Naomi, ed. 1981. Natural l a n g u a g e
information processing: A computer grammar of
..English and its applications. Reading, MA: AddisonWesley.
Salton, Gerald. 1983. introduction to modern
information retrieval. New York, N.Y.: McGraw-Hill.
Schfitze, Hinrich. 1993. Part-of-speech induction from
scratch. In Proceedings of the thirtv-first annual
meeting of the ACL held in Columbus, OH, June,
1993, by the Association for Computational
Linguistics. 251-58.
Schiitze, Hinrich, and Jan Pedersen. 1993. A vector
model for syntagmatic and paradigmatic relatedness. In
Proceedings of the ninth annual conference of the
Centre of the new OED and Text Research held in
Oxford, England, 1993, by the University of
Waterloo.
Berry, Michael, Theresa Do, Gavin O'Brien, Vijay
Krishna, and Sowmini Varadhan. 1993. SVDPACKC
(version 1.0) user's guide. Knoxville, TN: Department
of Computer Science, University of Tennessee.
Technical Report, CS-93~ 194.
Brent, Michael. 1993. From grammar to lexicon.
Unsupervised learning of lexical syntax. Computa-
1166
Wilms, G. Jan. 1995. Automated induction of a
lexical sublanguage grammar using a hybrid system
of corpus- and knowledge-based techniques. Ph.D.
diss., Mississippi State University.
© Copyright 2026 Paperzz