The Fitted Parse - ACL Anthology Reference Corpus

The Fitted Parse:
1 0 0 % P a r s i n g C a p a b i l i W in a S y n t a c t i c G r a m m a r o f E n g l i s h
K a r e n J e n s e n a n d G e o r g e E. H e i d o r n
C o m p u t e r Sciences Department
IBM T h o m a s J. W a t s o n R e s e a r c h C e n t e r
Y o r k t o w n Heights, H e w Y o r k 10598
There is a certain perception of parsing that leads to the
Abstract
d e v e l o p m e n t of techniques like this one: namely, that trying
to write a grammar to describe explicitly all and only the senfences of a n a t u r a l language is a b o u t as practical as trying to
find the H o l y Grail. H o t only will the effort e x p e n d e d be
H e r c u l e a n , it will be d o o m e d to failure. Instead we take a
heuristic a p p r o a c h a n d c o n s i d e r t h a t a n a t u r a l language parser
A technique is d e s c r i b e d for p e r f o r m i n g fitted parsing.
A f t e r the rules of a more c o n v e n t i o n a l syntactic g r a m m a r are
unable to produce a parse for a n input string, this technique
can be used to p r o d u c e a reasonable a p p r o x i m a t e parse that
c a n serve as input to the r e m a i n i n g stages of processing. The
p a p e r describes how fitted parsing is d o n e in the E P [ S T L E
c a n be divided into three parts:
system a n d discusses h o w it c a n help in dealing with m a n y
(a) a set of rules, called the core grammor, t h a t precisely
define the central, agreed-upon g r a m m a t i c a l structures
difficult problems of natural language analysis.
of a language;
Introduction
(b) p e r i p h e r a l p r o c e d u r e s t h a t h a n d l e p a r s i n g a m b i g u i t y :
w h e n the core g r a m m a r p r o d u c e s more than one parse,
these p r o c e d u r e s decide w h i c h of the multiple parses is
to be preferred;
The E P I S T L E project has as its l o n g - r a n g e goal the m a chine processing of natural language text in an office e n v i r o n ment. Ultimately we intend to have s o f t w a r e t h a t will be able
to parse a n d u n d e r s t a n d o r d i n a r y prose d o c u m e n t s (such as
(c) peripheral p r o c e d u r e s t h a t handle parsing failure: when
the c o r e g r a m m a r c a n n o t define a n a c c e p t a b l e parse,
these p r o c e d u r e s assign some r e a s o n a b l e structure tO the
those t h a t a n office principal might expect his s e c r e t a r y to
cope with), a n d will be able to generate at least a first d r a f t of
a business letter or memo. O u r c u r r e n t goal is a system for
critiquing written material on points of g r a m m a r a n d style.
input.
In E P I S T L E , (a) the core g r a m m a r consists at present of a set
of a b o u t 3 0 0 s y n t a x rules; (b) a m b i g u i t y is resolved by using a
O u r g r a m m a r is written in N L P ( H e i d o r n 1972). an a u g m e n t e d phrase s t r u c t u r e l a n g u a g e w h i c h is i m p l e m e n t e d in
L I S P / 3 7 0 . The E P I S T L E g r a m m a r currently uses syntactic,
metric that ranks alternative parses (Heidorn 1982): and (c)
parse failure is h a n d l e d by the fitting p r o c e d u r e described here.
but not semantic, information. Access to an on-line s t a n d a r d
dictionary with a b o u t 130.000 entries, including p a r t - o f - s p e e c h
and some o t h e r syntactic information (such as transitivity of
[n using the terms core grammar a n d periphery we are
consciously echoing recent w o r k in generative g r a m m a r , but we
are applying the terms in a s o m e w h a t different way. Core
verbs), m a k e s the s y s t e m ' s v o c a b u l a r y essentially unlimited.
We test and improve the g r a m m a r by regularly running it on a
d a t a base of 2 2 5 4 sentences from 411 actual business letters.
g r a m m a r , in c u r r e n t linguistic theory, suggests the notion of a
set of very general rules which define universal properties of
h u m a n language and effectively set limits on the types of
g r a m m a r s that a n y particular language may have; periphery
Most of these s e n t e n c e s are r a t h e r c o m p l i c a t e d ; the longest
contains 63 words, and the average length is 19.2 words.
phenomena are those constructions which are peculiar to particular languages and which require added rules beyond what
the core g r a m m a r will provide (Lasnik and Freidin 1981 ) O u r
Since the subset of English which is r e p r e s e n t e d in business d o c u m e n t s ,s very large, we need a very comprehensive
current work is not c o n c e r n e d with the meta-ruies of a Universal G r a m m a r . But we have f o u n d that a distinction between
core a n d periphery is useful even within a g r a m m a r of a p a n i c ular language ~ in this case, English.
g r a m m a r and robust parser. In the course of this work we
have developed some new techniques to help deal with the
refractory nature of natural language syntax. In this paper we
discuss one such technique: the fitted parse, which g u a r a n t e e s
the production of a reasonable parse tree for a n y string, no
This p a p e r first reviews parsing in E P I S T L E , a n d then
describes the fitting procedure, followed by several examples
of its application. Then the benefits of parse fitting and the
results of using it in our system are discussed, followed by its
relation to o t h e r work.
matter how u n o r t h o d o x that string may be. The parse which is
produced by fimng might not be perfect; but it will always be
reasonable and useful, and will allow for later refinement by
semantic processing.
93
Parsing in EPISTLE
EPISTLE's parser is written in the NLP programming
language, which w o r k s with a u g m e n t e d phrase s t r u c t u r e rules
a n d with a t t r i b u t e - v a l u e records, which are m a n i p u l a t e d by the
rules. W h e n N L P is used to parse n a t u r a l l a n g u a g e text, the
records describe constituents, and the rules put these constituents t o g e t h e r to form ever larger c o n s t i t u e n t (or r e c o r d ) structures. R e c o r d s c o n t a i n all the c o m p u t a t i o n a l a n d linguistic
i n f o r m a t i o n a s s o c i a t e d with w o r d s , with l a r g e r c o n s t i t u e n t s ,
a n d with the parse f o r m a t i o n . A t this time o u r g r a m m a r is
sentence-based; we do not, for instance, create record structures to describe paragraphs. Details of the EPISTLE system
a n d of its core g r a m m a r m a y be f o u n d in Miller et al., 1981,
a n d H e i d o r n et al., 1982.
A close e x a m i n a t i o n of parse trees p r o d u c e d b y the core
g r a m m a r will o f t e n reveal b r a n c h a t t a c h m e n t s t h a t are n o t
quite right: for example, semantically i n c o n g r u o u s p r e p o s i t i o n al p h r a s e a t t a c h m e n t s .
In line with o u r p r a g m a t i c p a r s i n g
philosophy, o u r core g r a m m a r is designed to p r o d u c e unique
a p p r o x i m a t e parses. (Recall t h a t we c u r r e n t l y h a v e a c c e s s
only to s y n t a c t i c a n d m o r p h o l o g i c a l i n f o r m a t i o n a b o u t constituents.) In the cases w h e r e s e m a n t i c o r p r a g m a t i c i n f o r m a t i o n
is n e e d e d before a p r o p e r a t t a c h m e n t c a n be m a d e , r a t h e r t h a n
p r o d u c e a c o n f u s i o n of multiple parses we force the g r a m m a r
to try to assign a single parse. This is usually d o n e by forcing
for widest, the leftmost of those is p r e f e r r e d . [f there is a tie
for leftmost, the o n e with the best value for the parse metric is
chosen. If there is still a tie (a very unlikely case), an a r b i t r a r y choice is made. ( N o t e t h a t we c o n s i d e r a VP to be a n y
s e g m e n t of text t h a t has a verb as its h e a d e l e m e n t . )
T h e fitting p r o c e s s is c o m p l e t e if the h e a d c o n s t i t u e n t
c o v e r s the e n t i r e i n p u t s t r i n g (as w o u l d be the c a s e if the
string c o n t a i n e d just a n o u n phrase, f o r e x a m p l e , " S a l u t a t i o n s
a n d c o n g r a t u l a t i o n s " ) . If the h e a d c o n s t i t u e n t does not c o v e r
the entire string, r e m a i n i n g c o n s t i t u e n t s are a d d e d o n e i t h e r
side. with the following o r d e r of p r e f e r e n c e :
(a) s e g m e n t s o t h e r t h a n VP;
(b) u n t e n s e d VPs:
(c) tensed VPs.
As with the choice of h e a d . the widest c a n d i d a t e is p r e f e r r e d
at e a c h step. The fit m o v e s o u t w a r d f r o m the h e a d . b o t h
l e f t w a r d to the b e g i n n i n g of the string, a n d r i g h t w a r d to the
end. until the entire i n p u t string has b e e n fitted into a best
a p p r o x i m a t e parse tree. The overall effect of the fitting p r o c ess is to select the l a r g e s t c h u n k of s e n t e n c e - l i k e m a t e r i a l
within a text string a n d c o n s i d e r it to be central, with l e f t - o v e r
c h u n k s o f text a t t a c h e d in s o m e r e a s o n a b l e m a n n e r .
A s a simple e x a m p l e , c o n s i d e r this text string w h i c h app e a r e d in o n e of o u r E P f S T L E d a t a base letters:
"Example:
some attachments to be m a d e to the closest, or rightmost,
available constituent. This strategy only rarely impedes the
type of g r a m m a r - c h e c k i n g a n d s t y l e - c h e c k i n g t h a t we are
w o r k i n g on. A n d we feel that a single parse with a c o n s i s t e n t
a t t a c h m e n t scheme will yield m u c h more easily to later s e m a n tic processing t h a n would a large n u m b e r of different s t r u c tures.
The rules of the core g r a m m a r ( C G ) p r o d u c e single app r o x i m a t e parses for the largest p e r c e n t a g e of input text. The
C G c a n a l w a y s be i m p r o v e d a n d its c o v e r a g e e x t e n d e d ; w o r k
on i m p r o v i n g the E P I S T L E C G is continual. But the c o v e r a g e
of a core g r a m m a r will never reach 1 0 0 % . N a t u r a l l a n g u a g e is
an o r g a n i c s y m b o l system; it d o e s not s u b m i t to c a s t - i r o n
control. F o r those strings t h a t c a n n o t be fully p a r s e d by rules
of the core g r a m m a r we use a heuristic best fit p r o c e d u r e t h a t
p r o d u c e s a r e a s o n a b l e parse structure.
75 p e r c e n t of $ 2 5 0 . 0 0 is $ 1 8 7 . 5 0 . "
Because this string has a capitalized first w o r d a n d a period at
its end. it is s u b m i t t e d to the c o r e g r a m m a r for c o n s i d e r a t i o n
as a s e n t e n c e . But it is not a s e n t e n c e , a n d so the C G will fail
to arrive at a c o m p l e t e d parse. H o w e v e r . d u r i n g p r o c e s s i n g .
the C G will have assigned m a n y s t r u c t u r e s to its m a n y substrings. L o o k i n g for a h e a d c o n s t i t u e n t a m o n g these s t r u c tures, the fitting p r o c e d u r e will first seek VPs with tense a n d
subject.
Several are p r e s e n t :
" $ 2 5 0 . 0 0 is". " p e r c e n t of
$ 2 5 0 . 0 0 is", " $ 2 5 0 . 0 0 is $ 1 8 7 . 5 0 " . a n d so on. The widest a n d
leftmost of these VP c o n s t i t u e n t s is the one which covers the
string " 7 5 p e r c e n t of $ 2 5 0 . 0 0 is $ 1 8 7 . 5 0 " , so it will be c h o s e n
as head.
T h e fitting process then looks for additional c o n s t i t u e n t s
to the left, f a v o r i n g o n e s o t h e r t h a n VP. [t finds first the
colon, a n d then the w o r d " E x a m p l e "
In this ~tring the only
c o n s t i t u e n t following the h e a d is the final period, which is duly
added. The c o m p l e t e fitted parse is s h o w n in Figure I.
The Fitting Procedure
The fitting procedure begins after the CG rules have been
applied in a bottom-up, parallel fashion, but have failed to
p r o d u c e an S node that covers the string. A t this point, as a
by-product of bottom-up parsing, records are available for
inspection that describe the various segments of the input
string from many perspectives, according to the rules that have
been applied. The term fitting has to do with selecting a n d
fitting these pieces of the analysis t o g e t h e r in a r e a s o n a b l e
fashion.
The form of parse tree used here shows the top-down
structure of the string from left to right, with the terminal
n o d e s being the last item o n e a c h line. A t e a c h level of the
tree (in a vertical c o l u m n ) , the h e a d e l e m e n t of a c o n s t i t u e n t is
m a r k e d with an asterisk. The o t h e r elements a b o v e a n d b e l o w
are pre- a n d post-modifiers. T h e highest element of the trees
s h o w n here is F I T T E D , r a t h e r t h a n the more usual SENT. (It
is i m p o r t a n t to r e m e m b e r t h a t these parse d i a g r a m s are only
shorthand representations for the NLP record structures, which
c o n t a i n a n a b u n d a n c e of i n f o r m a t i o n a b o u t the string p r o c essed.)
The algorithm p r o c e e d s in two main stages: first, a head
next, remaining constituents are fitted in.
In o u r c u r r e n t i m p l e m e n t a t i o n , c a n d i d a t e s for the h e a d are
tested preferentially as follows, f r o m most to least desirable:
(a) VPs with tense and subject;
(b) VPs with tense but no subject:
(c) segments other than VP:
(d) untensed VPs.
If more than one c a n d i d a t e is f o u n d in a n y c a t e g o r y , the one
p r e f e r r e d is the widest (covering most text). If there is a tie
constituent is chosen;
The tree of Figure I. w h i c h w o u l d be lost if we restricted
o u r s e l v e s to the precise rules of the c o r e g r a m m a r , is now
available for e x a m i n a t i o n , for g r a m m a r a n d style c h e c k i n g , a n d
ultimately for s e m a n t i c i n t e r p r e t a t i o n , It c a n take its place tn
the stream of c o n t i n u o u s text a n d be a n a l y z e d for w h a t it is
a sentence f r a g m e n t , i n t e r p r e t a b l e only by r e f e r e n c e to o t h e r
sentences in context.
9L.
FITTEDI---NP
......
NOUN----"Example"
[ ------'! : It
I ---VP"
I
I
I
I
I
I ------ll
....
NPI
.....
QUANT---NUM*
I .....
NOUN*---"percent"
I .....
PPl
.....
I.....
....
VERBS-_.,,is,,
....
NP ......
PREP
......
"75"
......
"of"
MONEY,
...... $250.00"
MONEY,--"$187.50"
, It
Figm'e 1. A n example fitted parse tree.
FITTED
I ---NP"
I ....
I
I
I
I
I....
I ....
N91 .....
I .....
AJ9
.....
CONJ*
- - -" and"
NP I ..... AJP .....
I
I .....
ADJ*
NOUN'---"
NOUN*---
....
"Good"
luck"
ADJ*
....
"good"
"se 11 ing"
FIil,,,'e 2. Fitted n o u n phrase ( f r a g m e n t ) .
FITTED I ---VP*
I
I
I
I
I
I
I
I
I
A D V " ....
"S econdly"
NP I .....
AJP .....
ADJ
I .....
N9 ......
NOUN*---"Annual"
l .....
I .....
NP ......
NP ......
NOUN*---"Commiss
ion"
NOUN s---''Statement''
I .....
NOUN'---"tota
....
AVP
....
....
....
....
I ....
VERB .... "s hou id"
VERB.---"be"
NP ......
MONEY*--"
* ....
"the"
I"
$ I a, 682.61
"
I ------" t 11
i ---AVP
I ---NP
.....
......
ADV*
....
"not"
MONEY*--"$
I ~, 682.67"
I ------" • ,r
Fit,urn 3. Fitted sentence with ellipsis.
Further Examples
routinely allowed n e g a d v i z e d NPs to follow main clauses,
because:
The fitted parse approach can help to deal with many
difficult natural language problems, including fragments, difficult cases of ellipsis, proliferation of rules to handle single
p h e n o m e n a , p h e n o m e n a for which no rule seems a d e q u a t e , a n d
p u n c t u a t i o n horrors.
Each of these is discussed here with
examples.
Fragments. There are many of
are frequently NPs, as in Figure 2.
ings. farewells, a n d sentiments.
paper are taken from the EPISTLE
(a) the p r o p e r analysis of this sentence would be obscured:
some pieces - - namely, the i n f e r r e d c o n c e p t s - - are
missing f r o m the s e c o n d part of the surface sentence;
(b) the linguistic g e n e r a l i z a t i o n would be lost:
a n y two
conjoined propositions can u n d e r g o deletion of identical
(recoverable) elements.
these in running text; they
and include c o m m o n greet(N.b., all examples in this
d a t a base.)
A fitted parse such as Figure 3 allows us to
clause for syntactic a n d stylistic deviances,
time makes clear the breaking point between
tions and opens the door for a later semantic
elided elements.
Difficult cases of ellipsis. In the sentence of Figure 3, what
we really have at a semantic level is a conjunction of two
propositions which, if generated directly, would read: "The
Annual Commission Statement total should be $14,682.61; the
Annual
Commission
Statement
total
should
not
be
S]4.682.67." Deletion processes operating on the second
proposition are lawful (deletion of identical elements), but
massive. It would be unwise to write a core grammar rule that
inspect the main
and at the same
the two propostprocessing of the
Proliferation of rules to handle single phenomena. There
are some English constructions which, although they have a
fairly simple and unitary form, do not hold anything like a
unitary ordering relation within clause boundaries. The vocative is one of these:
(a) Bit/. I've been asked to clarify the enclosed letter.
95
F I TTE
D I - - -NP
----
'* t
......
NOUN
* ---
"B i i i"
"
---VP*
I ....
NP ......
PRON'---"
i ....
VERB
....
" ' ve"
I ....
VERB
....
"been"
I ....
VERB,---"asked"
I ....
INFCL
I"
i --INFTO---"
I --VERB*
I --NP
to"
..... clarify"
I .....
AJP
.....
ADJ"
I.....
AJP
.....
VERB'---"
....
I .....
NOUN'---"
"the"
enclosed"
letter"
Figure 4. Fitted s e n t e n c e with initial vocative.
FITTED
---NP
---PP
---CONJ
---VP*
I .....
AJP
i .....
NOUN*---"
.....
ADJ*
....
l .....
PREP
I .....
NP ......
I .....
CONJ*---"
I .....
NP ......
PRON*---
....
[ ....
"and"
NP ......
PRON*---"
l ....
VERB*---"wish"
l ....
[ ....
NP ......
NP
.....
l
l
i ....
PP
"Good"
luck"
....
"to"
PRON*---"you"
and"
"yours"
I"
PRON*--AJP .....
"you"
ADJ* ....
.....
ADV
.....
"VERY"
.....
ADJ*
....
"best"
.....
PREP
....
.....
AJP
.....
ADJ*
....
"your"
.....
AJP
.....
ADJ*
....
"future"
.....
NOUN,---"
"the"
"in"
e f forts"
Figure 5. Fitted c o n j u n c t i o n of n o u n phrase with clause.
them. T h e fitted parse for this s e n t e n c e in Figure 5 presents
the finite clause as its h e a d a n d a d d s the r e m a i n i n g c o n s t i t u ents in a r e a s o n a b l e fashion. F r o m this s t r u c t u r e later s e m a n tic processing could infer that "Good luck to you and yours"
really m e a n s "1 e x p r e s s / s e n d / w i s h g o o d luck to y o u a n d
y o u r s " - - a special case of f o r m a l i z e d , ritualized ellipsis.
(b) I've been asked. BilL to clarify the enclosed letter.
(c) I've b e e n a s k e d to clarify the enclosed letter. Bill.
[n longer sentences there would be even more possible places
to insert the vocative, of course.
Rules c o u l d be written t h a t w o u l d explicitly allow the
p l a c e m e n t of a p r o p e r n a m e . s u r r o u n d e d by c o m m a s , at different positions in the sentence ~ a different rule for e a c h position. But this solution Lacks elegance, m a k e s a simple p h e n o m e n o n seem c o m p l i c a t e d , and always runs the risk of o v e r l o o k mg yet one more position where some o t h e r writer might insert
a vocative. The parse fitting p r o c e d u r e provides an alternative
that preserves the integrity of the main clause a n d adds the
vocative at a b r e a k in the structure, which is where it belongs.
as s h o w n in Figure 4. O t h e r similar p h e n o m e n a , such as parentheticaI expressions, c a n be handled in this same fashion.
Punctuation horrors. In a n y large sample of n a t u r a l lang u a g e text, there will be m a n y irregularities of p u n c t u a t i o n
which, a l t h o u g h perfectly u n d e r s t a n d a b l e to r e a d e r s , c a n c o m pletely disable a n explicit c o m p u t a t i o n a l g r a m m a r . In business
text these difficulties are f r e q u e n t . Some c a n he c a u g h t a n d
c o r r e c t e d by p u n c t u a t i o n c h e c k e r s a n d balancers. But others
c a n n o t , sometimes because, for all their trickiness, they ~tre not
really w r o n g . Yet few g r a m m a r i a n s w o u l d care to dignify, by
describing it with rules of the core g r a m m a r , a text string like:
" O p t i o n s : A l - ( T r a n s m i t t e r C l o c k e d by Dataset)
B 3 - ( w i t h o u t the 605 Recall Unit) C S - ( w i t h A B C
Ring Indicator) D 8 - t w i t h o u t A u t o A n s w e r ) E I 0 ( A u t o Ring Selective)."
Our parse fitting procedure handles this example by building a
string of NPs separated with punctuation marks, as shown in
Figure 6. This solution at least e n a b l e s us to get a handle on
the c o n t e n t s of the string.
Phenomena for which no rule seems adequate. The sentence " G o o d luck to you a n d yours a n d l wish you the very
best in your future e f f o r t s . " is. on the face of it. a c o n j u n c t i o n
of a noun phrase (or NP plus PP) with a finite verb phrase.
Such c o n s t r u c t i o n s are not usually c o n s i d e r e d to he fully g r a m matical, a n d a core g r a m m a r which c o n t a i n e d a rule d e s c r i b i n g
this c o n s t r u c t i o n o u g h t p r o b a b l y to be called a faulty g r a m m a r .
Nevertheless, o r d i n a r y English c o r r e s p o n d e n c e a b o u n d s with
strings of this sort. and readers have no difficulty c o n s t r u i n g
96
FITTED
I---NP
......
NOUN*
---
"Opt
ions"
I ------" : "
I---NP
......
I ------
1)
"--
I ---"
"AI "
NP ......
NOUN*---"
("
I ---NP
I .....
I
I---PP
I.....
I.....
I
I .....
I ---" ) "
I---NP ......
I ------
NOUN*---
)) "
tier"
NOUNS---"BY'
')
I ---PP*
:- - - N P
Transmi
NOUNe---"Clocked"
PREP ..... 'by"
NOUN*---"Dataset"
I ....
" ("
I ....
I....
I....
I....
I....
I ....
......
PREP .... "without"
AJP ..... ADJ* .... "the"
QUANT- -NUM • .... "6 0 5"
NP ......
NOUN*---"Recal
NOUN*---"Unit"
") "
NOUN*--"C5"
I.....
I.....
I.....
I.....
I.....
I .....
......
" ("
PREP ...... with"
NP ......
NOUN*--"ABC"
NP ......
NOUN s--- "Ring''
NOUN*---"
Indicator"
") "
NOUN'--"D8"
I .....
" ("
PREP ....
NP ......
i"
___,I_,!
---PP
---NP
___,,_,,
---PP
I .....
I .....
I .....
I .....
---NP
......
___,t
.
---NP
I .....
I.....
I.....
I.....
I .....
"w£ thou,"
NOUN*---
"AUTO"
NOUN e ..... Answer"
") "
NOUN*---"E
10"
" ("
NP ......
NOUN*---"Auto"
NP ......
NOUN*--"Ring''
NOUN*---"Selective"
") "
------". "
~re
a. Fitted list.
Benefits
include the fitted parsing technique, was run on the data base
of ?.254 sentences from business letters of various types, The
input corpus was very raw: it had not been edited for spelling
or other typing errors, nor had it been manipulated in any way
that might have made parsing easier.
There are t w o main benefits to be gained f r o m using the
fitted parse approach. First, it allows for syntactic processing
- - for our purposes, grammar and style checking - - to proceed
tn the absence o f a perfect parse. Second, it provides a promising structure to submit to later semantic processing routines.
A n d parenthetically, a fitted parse diagram is a great aid to
rule debugging. The place where the first break occurs between the head constituent and its pre- or post-modifiers usually indicates fairly precisely where the core g r a m m a r failed.
At that time the system failed to parse 832. or 3 6 % , of
the input s e n t e n c e s .
(It gave single parses for 41°%. double
parses for l i t , , and 3 or more parses for 12°'o.) Then we
added the fitting procedure and also worked to improve the
core grammar.
Concentrating only on those 832 sentences which in Dec e m b e r failed to parse, we ran the g r a m m a r again in July,
1982, on a subset of 163 of them. This time the n u m b e r of
core g r a m m a r rules was 300. Where originally the CG could
parse none of these 163 sentences, this time it yielded parses
(mostly single or double) for 109 of them. The remaining 54
were handled by the fitting procedure.
It should be emphasized that a fitting procedure cannot be
used as a substitute for explicit r u l e s , and that it in no way
lessens the importance of the core grammar. There is a tight
interaction between the two components. The success of the
fitted parse depends on the accuracy and completeness of the
core rules; a fit is only as good as its grammar.
Close analysis of the 54 fitted parses revealed that 14 of
these sentences bypass the core grammar simply because of
missing dictionary information: for example, the C G contains
a rule to parse ditransitive VPs (indirect object-taking VPs
Results
In December of 1981. the E P I S T L E grammar, which at
that time consisted of about 250 g r a m m a r rules and did not
.97
sistency constraints (grammatical restrictions, like Kwasny and
Sondheimer's co-occurrence violations) and also by relaxing
ordering constraints.
However. they are working with a
restricted-domain semantic system and their approach, as they
admit, "does not embody a solution for flexible parsing of
natural language in general" (p. 236).
with verbs like "give" or "send"), but that rule will not apply
if the verb is not marked as ditransitive. The EPISTLE dictionary will eventually have all ditransitive verbs marked properly, but right now it does not.
Removing those 14 sentences from consideration, we are
left with a residue of 40 strings, or about 25% of the 163
sentences, which we expect always to handle by means of the
fitted parse. These strings include all of the problem types
mentioned above (fragments, ellipsis, etc.), and the fitted
parses produced were adequate for our purposes. It is not yet
clear how this 25% might extrapolate to business text at large,
but it seems safe to say that there will always be a significant
percentage of natural business correspondence which we cannot expect to parse with the core grammar, but which responds
nicely to peripheral processing techniques like those of the
fitted parse. (A more recent run of the entire data base resulted m 27% fitted parses.)
The work of WilLS is heavily semantic and therefore quite
different from EPISTLE, but his general philosophy meshes
nicely with the philosophy of the fitted parse: "It is proper to
prefer the normal...but it would be absurd...not to accept the
abnormal if it is described" (WilLs 1975, p. 267). WilLS"
approach to machine translation which involves doing some
amount of the translation on a phrase-by-phrase basis is relevant here. too, With fitted parsing, it might be possible to get
usable translations for strings that cannot be completely parsed
with the core grammar by translating each phrase of the fitted
parse separately.
Related W o r k
Acknowledgements
Although we know of no approach quite like the one
described here, other related work has been done. Most of
this work suggests that unparsable or ill-formed input should
be handled by relaxation techniques, i.e., by relaxing restrictions in the grammar rules in some principled way. This is
undoubtedly a useful strategy - - one which EPISTLE makes
use of, in fact, in its rules for detecting grammatical errors
(Heidorn et al. 1982). However. it is questionable whether
such a strategy can ultimately succeed in the face of the overwhelming (for all practical purposes, infinite) variety of illformedness with which we are faced when We set out to parse
truly unrestricted natural language input. If all ili-formedness
is rule-based (Weischedel and Sondheimer 1981, p. 3), it can
only be by some very loose definition of the term rule, such as
that which might apply to the fitting algorithm described here.
We would like to thank Lance Miller for his interest and
encouragement in this work, and Roy Byrd, Martin Chodorow,
Robert Granville and John Sown for their comments on an
earlier version of this paper.
References
Hayes, P.J. and G.V. Mouradian. 1981.
Am. J. Camp. Ling. 7.4. 232-242.
"Flexible Parsing" in
Heidorn, G.E. 1972. "Natural Language Inputs to a Simulation Programming System."
Technical Report NPS55HD72101A. Monterey, Calf Naval Postgraduate School.
Heidorn, G.E. 1982. "Experience with an Easily Computed
Metric for Ranking Alternative Parses" in Proc. 20th Annual Meeting o f the ACL. Toronto, Canada, 82-84.
Thus Weischedel a n d Black, 1980, suggest three techniques for responding intelligently to unparsable inputs:
Heidorn. G.E., K. Jensen. L.A. Miller. R.J. Byrd, and M.S.
Chodorow.
1982.
"The EPISTLE Text-Critiquing System" in IBM gys, J. 21.3. 305-326.
(al using presuppositions to determine user assumptions;
this course is not available to a syntactic grammar like
EPISTLE's;
Kwasny, S.C. and N.K. Sondheimer. 1981. "Relaxation Techniques for Parsing iU-Formed Input" in Am. J. Comp. Ling.
7.2. 99-108
Ibl using relaxation techniques;
(cJ supplying the user with information about the point
where the parse blocked; this would require an interactive environment, which would not be possible for every
type of natural language processing application.
Lasnik, H. and R. Freidin. 1981. "Core Grammar, Case
Theory. and Markedness" in Proc. 1979 G L O W Conf.
Pisa, Italy.
Miller. L.A., G.E. Heidorn and K. Jensen.
1981. "TextCritiquing with the EPISTLE System: An Authors's Aid to
Better Syntax" in A F I P S Con/'. Proc.. Vol. 50. Arlington.
Va., 649-655.
Kwasny and Sondheimer. 1981. are strongproponents of
relaxation techniques, which they use to handle both cases of
clearly ungrammatical structures, such as co-occurrence violar~ons like subject/verb disagreement, and cases of perfectly
acceptable but difficult constructions (ellipsis and conjunction).
Weischedel. R.M. and J.E. Black. 1980. "Responding Intelligently to Unparsable Inputs" in Am. J. Comp. Ling. 6.2.
97-109.
Weischedel, R.M. and N.K. Sondheimer. 1981. "A Framework for Processing Ill-Formed Input." Research Report.
Univ. of Delaware.
Weischedel and Sondheimer. 1982. describe an improved
ellipsis processor. No longer is ellipsis handled with relaxation
techniques, but by predicting transformatwns of previous parsing paths which would allow for the matching of fragments
with plausible contexts. This plan would be appropriate as a
next step after the fitted parse, but it does not guarantee a
parse for all elided inputs.
Weischedel, R.M. and N.K. Sondheimer.
1982. "An Improved Heuristic for Ellipsis Processing" in Proc. 20th Annual Meeting o f the ACL. Toronto, Canada, 85-88.
Hayes and Mouradian, 1981. also use the relaxation meThey achieve flexibility in their parser by relaxing con-
Wilks. Yorick, 1975, "An Intelligent Analyzer and Understander of English" in Comm. AC M 18.5. 264-274.
thod.
98