Abstraction of Pattern Rule Set for detecting Noun Phrase based on

社団法人 情報処理学会 研究報告
IPSJ SIG Technical Report
2004−NL−160 (3)
2004/3/4
Abstraction of Pattern Rule Set for detecting Noun Phrase
based on Linguistic Hypothesis
Ekaterina Goriouchkina † Akira Adachi
and Takenori Makino †
††
Precise noun phrase detection is a significant technology in syntactic analysis. This paper deals with a simple method to refine a
pattern rule set extracted from a large corpus. First, the pattern
rule set is abstracted in the form of regular expression by simple hypothesis. Then, a refining algorithm removes ineffective rules from
the pattern rule set. As a result of experiment on an evaluation
corpus, the accuracy of noun phrase detection reaches aroud 78% in
sentence.
Keywords: refining algorithm, noun phrase detection, automaton,
corpus, abstraction
This paper deals with the improved method
of a refining algorithm proposed in the previous paper. The refining algorithm removes ineffective pattern rules from a rule set extracted
from a tagged corpus. Although the refining
algorithm achieved the precise noun phrase detection, the result suggested to introduce abstraction of the rule set for further improving.
This approach assumes a reasonable hypothesis for abstracting the pattern rules. Then the
abstracted pattern rules with the hypothesis is
refined with the refining algorithm.
The result shows the accuracy of noun phrase
detection on an evaluation corpus is around 78%
in sentence. Although the corpus is tagged with
1. Introduction
Precise noun phrase detection is a significant technology in natural language processing.
Some stochastic approaches have often tried to
achieve it. Supper tagger2) detects noun phrases
with a stochastic learning method on syntactic structures of them on Penn TreeBank English corpus3) . Clunking4) is also used TreeBank
to learn stochastic rules with Support vector
machine. Although such stochastic approaches
bring high precision, it is difficult to improve
the performance more, because using stochastic
measurement.
† Department of Information Sciences Toho University
†† Division of Media Solution, Fujitsu Co. Limitted
1
−17−
POS, this simple method is useful to realize a
noun phrase detection system.
First, a searching algorithm for noun phrases
are described, and the refining algorithm is introduced briefly. Then, the abstraction strategy
is described, and then, the experiment results
are discussed.
information to verify the detection algorithm,
though the definition of noun phrase and a set
of POS in it are not just fitted to this objection.
The paper aims at proposing an available and
simple method to identify noun phrases with
pattern rules extracted from a tagged corpus.
The following table 1 shows the description of
a sentence in WSJ corpus. Noun phrases are
denoted by brackets, and each word is tagged
by POS.
2. Noun Phrase
The definition of noun phrase is various. For a
sentence “Time flies like an arrow.”, the phrase
“like an arrow” is a noun phrase in Phrase structure grammar. The definition in the paper is
deferent from that. A noun phrase is defined
as a phrase that does not include verbs and adverbs. The noun phrase has a noun as phrasal
head, nouns, adjectives and preposition between
nouns. The noun phrase is regarded as an element modified by a verbal phrase which has a
verb as phrasal head. For the above sentence,
noun phrase is “Time” and “an arrow”, and “arrow” is a phrasal head.
In the noun phrase, information concerned
with the modification between words, like case
relation, is a few, because of a lack of verb. For
example, “John’s books and pens” and “Taro
to Hanako no Musume(in Japanese)” has ambiguity on the modification. It is difficult to
solve the ambiguity by syntactic rules, even by
semantic information. Human can understand
such noun phrase with ambiguity.
Syntactic structure derived by rules is often
not clear in the noun phrase, and word pattern is essential in it. To abandon the syntactic
structure analysis in noun phrase contributes to
achieve precise syntactic analysis of sentence.
This is a reason for realizing a noun phrase detection by pattern rules, but not by syntactic
rules.
Fortunately, there exists a large corpus based
on the same definition for the noun phrase.
Wall street Journal(WSJ) in Penn Tree Bank
3 is used as the corpus. This involves sufficient
Table 1 Example of WSJ corpus
[ J.P./NNP Bolduc/NNP ]
,/,
[ vice/NN chairman/NN ]
of/IN
[ W.R./NNP Grace/NNP ]
&/CC
[ Co./NNP ]
,/,
[ which/WDT ]
holds/VBZ
[ a/DT 83.4/CD %/NN interest/NN ]
in/IN
[ this/DT energy-services/JJ company/NN ]
,/, was/VBD elected/VBN
[ a/DT director/NN ]
./.
In the table, POS is annotated by capital letters; the top letter N denotes noun, V verb, W
wh, J adjective and DT denotes determiner, IN
proposition. The number of POS is 36 tags.
WSJ corpus consists of more than 70,000 sentences.
3. Searching Noun Phrase with Fi-
nite Automaton
The abstracted pattern rule is written in the
form of Regular grammar which has three operators; concatenation, selection(|) and power(*).
To search noun phrases, a finite automaton5)
is used. This finite automaton just carries out
a longest match for pattern rules. So that, in
searching noun phrase, preference is given to
the longest pattern rule which matches the input sequence.
Although the longest match algorithm has a
certain limitation for detecting noun phrases, as
2
−18−
well known in Kana Kanji transformation, it is a
simple mechanism to make it possible to adopt
a large training corpus for training. There exits
the upper bounds for searching noun phrases is
brought by the limitation, though it is not evaluated in the experiment.
In this implementation, the finite automaton
has some constraints to make programming coding simple. The selection and power operators
can not assign covering with plural states. That
is, a state in the finite automaton can perform
the selection of POSs and the power(cycle).
Figure 1 shows a simplified example of the finite automaton.
measure the accuracy of noun phrase detection
on a training corpus. When the accuracy improves more than the accuracy before removing
the rule, the rule is ineffective one. This pattern
rule is removed from the pattern rule set. This
procedure is carried out for all of pattern rules.
This procedure has a problem related to the
masking order. Ineffective pattern rules that are
remained without verification yet may disturb
the verification. To solve the problem, the refining algorithm has two steps; forward scan and
backward scan. In the forward scan, each pattern rule is checked and removed for ascendant
in turn. After the forward scan, the backward
scan is carried out for the remained pattern rule
set for descendant.
90
(%)
80
70
Fig.1 An example of finite automaton
60
50
40
30
4. Refining Algorithm
20
10
The refining algorithm reported in the previous paper is introduced briefly. When using all
pattern rules obtained from WSJ tagged corpus,
the precision of detecting noun phrases is only
20 to 30 % in sentence. Occurrence of a few
ineffective rules obstructs the detection. The
refining algorithm aims at discovering such ineffective rules.
The algorithm is simple one which is similar with the masking procedure introduced in
Clunking algorithm4) . To check the contribution of a rule to the performance, the performance is examined by masking the rule. Clunking algorithm uses the contribution as data for
SVM. The refining algorithm, however, directly
uses it to find out the ineffective rules .
To check whether a pattern rule is effective
or not, the refining algorithm removes the pattern rule from the pattern rule set, and then it
0
0
200
400
600
800
1000
1200
1400
1600
1800
Fig.2 Process of the refining algorithm
Figure 2 shows process of the refining algorithm for training corpus with 50,000 noun
phrases. The horizontal axis is the pattern rule
number, and vertical axis is the accuracy when
the rule is removed. With the progress, the
number of pattern rules in the set is deceased
because of removing. The dropped accuracy
shows that the masked rule is effective. Then,
the rule is return into the pattern rule set.
Figure 3 shows the accuracy of noun phrase
detection on an evaluation corpus with 100,000
noun phrases. The accuracy is in sentence, but
not in phrase. According to the size of training
corpus, the accuracy improved gradually. For
the training corpus size is more than 100,000,
3
−19−
80
6. Training Pattern Rule on Training
(%)
Corpus
78
An initial pattern rules set are a set of POS
pattern indicated as noun phrase in a training
corpus. In this experiment, the size of training
corpus is from 10,000 to 140,000 noun phrases
by 10,000 noun phrases.
First, The initial pattern rule set extracted
from the training corpus is written into regular
expression according to the abstraction strategy. And then, the result pattern rule set is
refined by the refining algorithm to remove ineffective rules in the set.
76
74
72
70
68
66
0
20000
40000
60000
80000
100000
120000
140000
Fig.3 Accuracy of noun phrase detection on an
evaluation corpus
the increase is linear. The result suggests the
need of introducing rule abstraction.
(%)
90
88
5. Linguistic
Hypothesis
for
Ab-
86
straction
84
To investigate the effect of rule abstraction,
we suppose simple hypothesis that is reasonable
linguistically.
As A. Mikheev6) considered proper noun(NNP)
and plural proper noun(NNPS) to single proper
name, a selection operator is used for POSs that
is not essential to distinct. Then, a power operator is used for iterated POS more than two
times for a pattern rule.
Overviewing the pattern rule set extracted
from a training corpus, the linguistic hypothesis is made as follows,
82
80
78
76
0
20000
40000
60000
80000
100000
120000
140000
Fig.4 Noun phrase detection accuracy in
sentence on the training corpus
Figure 4 shows the noun phrase detection accuracy in sentence on the training corpus for 15
cases of different training corpus size. The dot
line shows the results in the case of the refining algorithm. The real line shows the results
in the case of the refining algorithm after the
abstraction.
Figure 5 shows the number of pattern rules for
the training corpus sizes. The real line shows in
the refining algoritum after abstraction, and the
dot line shows in the refining algorithm. Comparing with the refining algorithm, the number
of pattern rules with training corpus size becomes small dramatically. The result, however,
suggests that the number of new rules increases
according to the training corpus size yet.
• All noun(NN, NNP, NNS, NNPS,) to single
POS using selection.
• Consecutive POS in a pattern rule to a
power expression; for example,
CD CD JA NN NN −→ CD* JA NN*
The pattern rules are rewritten in the form of
regular expression according to the hypothesis.
In this examination, the above two are
adopted as a certain effective criteria.
4
−20−
2000
(%)
80
1800
1600
78
1400
76
1200
74
1000
72
800
70
600
68
400
200
0
20000
40000
60000
80000
100000
120000
66
140000
0
Fig.5 The number of pattern rules for the
training corpus size
20000
40000
60000
80000
100000
120000
140000
Fig.6 The accuracy of noun phrase detection
on the evaluation corpus
As a result of introducing the abstraction,
the accuracy curve becomes smooth and stable.
This suggests that the abstraction strategy generalizes the pattern rule set effectively, though
it is very simple.
8. Discussion
The evaluation shows an available noun
phrase detection can be achieved by a simple
method proposed in this paper. This method is
not an expensive stochastic learning method. A
pattern rule set extracted from training corpus
is rewritten into regular expression according to
the simple linguistic hypothesis, and then the
refining algorithm removes ineffective rules in
the set.
The accuracy reached around 78% on the evaluation corpus. This score is on the tagged corpus where every words of sentence in the corpus are tagged by POS information. The score,
of course, decreases for raw sentences without
tagged data.
As shown in Figure 5, the increase of the number of new pattern rules with the training corpus size does not saturate. The reason is that
the abstraction introduced is not sufficient.
The POS set in WSJ corpus is not design toward this noun phrase detection. When designing the set of POS carefully for the noun
phrase detection, the accuracy will improved
more. There also remains problems on improving a search algorithm for noun phrase.
In order to realize practical noun phrase detection systems, the following further researches
are required to achieve high performance systems;
7. Evaluation
The previous section discusses the accuracy
on training corpus. To evaluate the accuracy in
more general case, an evaluation corpus is provided by selecting a different portion of WSJ
corpus from the training corpus. The evaluation corpus is 100,000 noun phrases in size.
POS information is tagged for every words in
the corpus. In this experiment, the POS pattern in the evaluation corpus are used as input
sentences. So that, the noun phrase detection
is performed not for raw sentence.
Figure 6 shows the accuracy of noun phrase
detection on the evaluation corpus for pattern
rule sets obtained from 15 training corpus. The
dot line shows in the refining algorithm, and the
real line shows in the refining algorithm after
abstraction.
The accuracy of noun phrase detection in
the refining algorithm after abstraction reaches
around 78% in sentence. The increase of accuracy becomes stable at more than 100,000 noun
phrases.
5
−21−
• selecting a suitable set of POS for the noun
phrase detection
detecting noun phrases in English”,Information
Processing of Japan,NL140-23, 2004.
• improving a search algorithm for noun
phrases combining with for verb phrases to
solve POS ambiguity
With identifying noun phrases precisely, the
accuracy of syntax analysis will increase more.
9. Conclusions
This paper aims at proposing a simple nonstochastic method to detecting noun phrases using pattern rules extracted from a large scale
tagged corpus. A refining algorithm removes
ineffective rules for the noun phrase detection
by finite automaton. The results suggests the
need of rule abstraction. For the abstraction, a
simple hypothesis which is reasonable linguistically is introduced.
There exists many candidates for the hypothesis. In this experiment, the hypothesis uses the
following two strategies. All kinds of noun are
to single noun using a selection operator, and
consecutive same POS is replaced by the power
of the POS.
The refining algorithm refines the rule set, after the rule set is rewritten by the hypothesis.
As a result, the noun phrase detection accuracy
is improved. The accuracy reaches around 78%
in sentence on a evaluation corpus. This result
shows that the abstraction and refining algorithm is available to construct a noun phrase
detection system with high accuracy. The algorithm is so simple that it can apply to a large
training corpus.
A high speed noun phrase detection with high
accuracy will bring a basis on high performance
natural language processing systems in future.
References
1)
Ekaterina Goriouchkina and Akira Adachi.
“Refining algorithm of pattern rule set for
6
−22−
2)
Aravind K.Joshi and Srinivas Bangalore.
“Supertagging: An Approach to Almost
Parsing”, Computational Linguistics, Volume 25, Number 2, pages 237-264, June
1999.
3)
Beatrice Santorini. “Part-of-Speech Tagging Guidelines for the Penn Treebank
Project.(3rd revision, 2nd printing)”, Linguistic Data Consortium,June 1990.
4)
Taku Kudoh and Yuji Matsumoto. “Chunking with Support Vector Machines”, Information Processing of Japan,NL 140, pages
9-16, 2000.
5)
Rafael C.Carrasco and Mikel L.Forcada.
“Incremental Construction and Maintenance of Minimal Finite-State Automata”,
Computational Linguistics, Volume 28,
Number 2, pages 206-217, June 2002.
6)
Andrei Mikheev. “Periods, Capitalized
Words, etc” Computational Linguistics 28,
3, 289-318 2002