社団法人 情報処理学会 研究報告 IPSJ SIG Technical Report 2004−NL−160 (3) 2004/3/4 Abstraction of Pattern Rule Set for detecting Noun Phrase based on Linguistic Hypothesis Ekaterina Goriouchkina † Akira Adachi and Takenori Makino † †† Precise noun phrase detection is a significant technology in syntactic analysis. This paper deals with a simple method to refine a pattern rule set extracted from a large corpus. First, the pattern rule set is abstracted in the form of regular expression by simple hypothesis. Then, a refining algorithm removes ineffective rules from the pattern rule set. As a result of experiment on an evaluation corpus, the accuracy of noun phrase detection reaches aroud 78% in sentence. Keywords: refining algorithm, noun phrase detection, automaton, corpus, abstraction This paper deals with the improved method of a refining algorithm proposed in the previous paper. The refining algorithm removes ineffective pattern rules from a rule set extracted from a tagged corpus. Although the refining algorithm achieved the precise noun phrase detection, the result suggested to introduce abstraction of the rule set for further improving. This approach assumes a reasonable hypothesis for abstracting the pattern rules. Then the abstracted pattern rules with the hypothesis is refined with the refining algorithm. The result shows the accuracy of noun phrase detection on an evaluation corpus is around 78% in sentence. Although the corpus is tagged with 1. Introduction Precise noun phrase detection is a significant technology in natural language processing. Some stochastic approaches have often tried to achieve it. Supper tagger2) detects noun phrases with a stochastic learning method on syntactic structures of them on Penn TreeBank English corpus3) . Clunking4) is also used TreeBank to learn stochastic rules with Support vector machine. Although such stochastic approaches bring high precision, it is difficult to improve the performance more, because using stochastic measurement. † Department of Information Sciences Toho University †† Division of Media Solution, Fujitsu Co. Limitted 1 −17− POS, this simple method is useful to realize a noun phrase detection system. First, a searching algorithm for noun phrases are described, and the refining algorithm is introduced briefly. Then, the abstraction strategy is described, and then, the experiment results are discussed. information to verify the detection algorithm, though the definition of noun phrase and a set of POS in it are not just fitted to this objection. The paper aims at proposing an available and simple method to identify noun phrases with pattern rules extracted from a tagged corpus. The following table 1 shows the description of a sentence in WSJ corpus. Noun phrases are denoted by brackets, and each word is tagged by POS. 2. Noun Phrase The definition of noun phrase is various. For a sentence “Time flies like an arrow.”, the phrase “like an arrow” is a noun phrase in Phrase structure grammar. The definition in the paper is deferent from that. A noun phrase is defined as a phrase that does not include verbs and adverbs. The noun phrase has a noun as phrasal head, nouns, adjectives and preposition between nouns. The noun phrase is regarded as an element modified by a verbal phrase which has a verb as phrasal head. For the above sentence, noun phrase is “Time” and “an arrow”, and “arrow” is a phrasal head. In the noun phrase, information concerned with the modification between words, like case relation, is a few, because of a lack of verb. For example, “John’s books and pens” and “Taro to Hanako no Musume(in Japanese)” has ambiguity on the modification. It is difficult to solve the ambiguity by syntactic rules, even by semantic information. Human can understand such noun phrase with ambiguity. Syntactic structure derived by rules is often not clear in the noun phrase, and word pattern is essential in it. To abandon the syntactic structure analysis in noun phrase contributes to achieve precise syntactic analysis of sentence. This is a reason for realizing a noun phrase detection by pattern rules, but not by syntactic rules. Fortunately, there exists a large corpus based on the same definition for the noun phrase. Wall street Journal(WSJ) in Penn Tree Bank 3 is used as the corpus. This involves sufficient Table 1 Example of WSJ corpus [ J.P./NNP Bolduc/NNP ] ,/, [ vice/NN chairman/NN ] of/IN [ W.R./NNP Grace/NNP ] &/CC [ Co./NNP ] ,/, [ which/WDT ] holds/VBZ [ a/DT 83.4/CD %/NN interest/NN ] in/IN [ this/DT energy-services/JJ company/NN ] ,/, was/VBD elected/VBN [ a/DT director/NN ] ./. In the table, POS is annotated by capital letters; the top letter N denotes noun, V verb, W wh, J adjective and DT denotes determiner, IN proposition. The number of POS is 36 tags. WSJ corpus consists of more than 70,000 sentences. 3. Searching Noun Phrase with Fi- nite Automaton The abstracted pattern rule is written in the form of Regular grammar which has three operators; concatenation, selection(|) and power(*). To search noun phrases, a finite automaton5) is used. This finite automaton just carries out a longest match for pattern rules. So that, in searching noun phrase, preference is given to the longest pattern rule which matches the input sequence. Although the longest match algorithm has a certain limitation for detecting noun phrases, as 2 −18− well known in Kana Kanji transformation, it is a simple mechanism to make it possible to adopt a large training corpus for training. There exits the upper bounds for searching noun phrases is brought by the limitation, though it is not evaluated in the experiment. In this implementation, the finite automaton has some constraints to make programming coding simple. The selection and power operators can not assign covering with plural states. That is, a state in the finite automaton can perform the selection of POSs and the power(cycle). Figure 1 shows a simplified example of the finite automaton. measure the accuracy of noun phrase detection on a training corpus. When the accuracy improves more than the accuracy before removing the rule, the rule is ineffective one. This pattern rule is removed from the pattern rule set. This procedure is carried out for all of pattern rules. This procedure has a problem related to the masking order. Ineffective pattern rules that are remained without verification yet may disturb the verification. To solve the problem, the refining algorithm has two steps; forward scan and backward scan. In the forward scan, each pattern rule is checked and removed for ascendant in turn. After the forward scan, the backward scan is carried out for the remained pattern rule set for descendant. 90 (%) 80 70 Fig.1 An example of finite automaton 60 50 40 30 4. Refining Algorithm 20 10 The refining algorithm reported in the previous paper is introduced briefly. When using all pattern rules obtained from WSJ tagged corpus, the precision of detecting noun phrases is only 20 to 30 % in sentence. Occurrence of a few ineffective rules obstructs the detection. The refining algorithm aims at discovering such ineffective rules. The algorithm is simple one which is similar with the masking procedure introduced in Clunking algorithm4) . To check the contribution of a rule to the performance, the performance is examined by masking the rule. Clunking algorithm uses the contribution as data for SVM. The refining algorithm, however, directly uses it to find out the ineffective rules . To check whether a pattern rule is effective or not, the refining algorithm removes the pattern rule from the pattern rule set, and then it 0 0 200 400 600 800 1000 1200 1400 1600 1800 Fig.2 Process of the refining algorithm Figure 2 shows process of the refining algorithm for training corpus with 50,000 noun phrases. The horizontal axis is the pattern rule number, and vertical axis is the accuracy when the rule is removed. With the progress, the number of pattern rules in the set is deceased because of removing. The dropped accuracy shows that the masked rule is effective. Then, the rule is return into the pattern rule set. Figure 3 shows the accuracy of noun phrase detection on an evaluation corpus with 100,000 noun phrases. The accuracy is in sentence, but not in phrase. According to the size of training corpus, the accuracy improved gradually. For the training corpus size is more than 100,000, 3 −19− 80 6. Training Pattern Rule on Training (%) Corpus 78 An initial pattern rules set are a set of POS pattern indicated as noun phrase in a training corpus. In this experiment, the size of training corpus is from 10,000 to 140,000 noun phrases by 10,000 noun phrases. First, The initial pattern rule set extracted from the training corpus is written into regular expression according to the abstraction strategy. And then, the result pattern rule set is refined by the refining algorithm to remove ineffective rules in the set. 76 74 72 70 68 66 0 20000 40000 60000 80000 100000 120000 140000 Fig.3 Accuracy of noun phrase detection on an evaluation corpus the increase is linear. The result suggests the need of introducing rule abstraction. (%) 90 88 5. Linguistic Hypothesis for Ab- 86 straction 84 To investigate the effect of rule abstraction, we suppose simple hypothesis that is reasonable linguistically. As A. Mikheev6) considered proper noun(NNP) and plural proper noun(NNPS) to single proper name, a selection operator is used for POSs that is not essential to distinct. Then, a power operator is used for iterated POS more than two times for a pattern rule. Overviewing the pattern rule set extracted from a training corpus, the linguistic hypothesis is made as follows, 82 80 78 76 0 20000 40000 60000 80000 100000 120000 140000 Fig.4 Noun phrase detection accuracy in sentence on the training corpus Figure 4 shows the noun phrase detection accuracy in sentence on the training corpus for 15 cases of different training corpus size. The dot line shows the results in the case of the refining algorithm. The real line shows the results in the case of the refining algorithm after the abstraction. Figure 5 shows the number of pattern rules for the training corpus sizes. The real line shows in the refining algoritum after abstraction, and the dot line shows in the refining algorithm. Comparing with the refining algorithm, the number of pattern rules with training corpus size becomes small dramatically. The result, however, suggests that the number of new rules increases according to the training corpus size yet. • All noun(NN, NNP, NNS, NNPS,) to single POS using selection. • Consecutive POS in a pattern rule to a power expression; for example, CD CD JA NN NN −→ CD* JA NN* The pattern rules are rewritten in the form of regular expression according to the hypothesis. In this examination, the above two are adopted as a certain effective criteria. 4 −20− 2000 (%) 80 1800 1600 78 1400 76 1200 74 1000 72 800 70 600 68 400 200 0 20000 40000 60000 80000 100000 120000 66 140000 0 Fig.5 The number of pattern rules for the training corpus size 20000 40000 60000 80000 100000 120000 140000 Fig.6 The accuracy of noun phrase detection on the evaluation corpus As a result of introducing the abstraction, the accuracy curve becomes smooth and stable. This suggests that the abstraction strategy generalizes the pattern rule set effectively, though it is very simple. 8. Discussion The evaluation shows an available noun phrase detection can be achieved by a simple method proposed in this paper. This method is not an expensive stochastic learning method. A pattern rule set extracted from training corpus is rewritten into regular expression according to the simple linguistic hypothesis, and then the refining algorithm removes ineffective rules in the set. The accuracy reached around 78% on the evaluation corpus. This score is on the tagged corpus where every words of sentence in the corpus are tagged by POS information. The score, of course, decreases for raw sentences without tagged data. As shown in Figure 5, the increase of the number of new pattern rules with the training corpus size does not saturate. The reason is that the abstraction introduced is not sufficient. The POS set in WSJ corpus is not design toward this noun phrase detection. When designing the set of POS carefully for the noun phrase detection, the accuracy will improved more. There also remains problems on improving a search algorithm for noun phrase. In order to realize practical noun phrase detection systems, the following further researches are required to achieve high performance systems; 7. Evaluation The previous section discusses the accuracy on training corpus. To evaluate the accuracy in more general case, an evaluation corpus is provided by selecting a different portion of WSJ corpus from the training corpus. The evaluation corpus is 100,000 noun phrases in size. POS information is tagged for every words in the corpus. In this experiment, the POS pattern in the evaluation corpus are used as input sentences. So that, the noun phrase detection is performed not for raw sentence. Figure 6 shows the accuracy of noun phrase detection on the evaluation corpus for pattern rule sets obtained from 15 training corpus. The dot line shows in the refining algorithm, and the real line shows in the refining algorithm after abstraction. The accuracy of noun phrase detection in the refining algorithm after abstraction reaches around 78% in sentence. The increase of accuracy becomes stable at more than 100,000 noun phrases. 5 −21− • selecting a suitable set of POS for the noun phrase detection detecting noun phrases in English”,Information Processing of Japan,NL140-23, 2004. • improving a search algorithm for noun phrases combining with for verb phrases to solve POS ambiguity With identifying noun phrases precisely, the accuracy of syntax analysis will increase more. 9. Conclusions This paper aims at proposing a simple nonstochastic method to detecting noun phrases using pattern rules extracted from a large scale tagged corpus. A refining algorithm removes ineffective rules for the noun phrase detection by finite automaton. The results suggests the need of rule abstraction. For the abstraction, a simple hypothesis which is reasonable linguistically is introduced. There exists many candidates for the hypothesis. In this experiment, the hypothesis uses the following two strategies. All kinds of noun are to single noun using a selection operator, and consecutive same POS is replaced by the power of the POS. The refining algorithm refines the rule set, after the rule set is rewritten by the hypothesis. As a result, the noun phrase detection accuracy is improved. The accuracy reaches around 78% in sentence on a evaluation corpus. This result shows that the abstraction and refining algorithm is available to construct a noun phrase detection system with high accuracy. The algorithm is so simple that it can apply to a large training corpus. A high speed noun phrase detection with high accuracy will bring a basis on high performance natural language processing systems in future. References 1) Ekaterina Goriouchkina and Akira Adachi. “Refining algorithm of pattern rule set for 6 −22− 2) Aravind K.Joshi and Srinivas Bangalore. “Supertagging: An Approach to Almost Parsing”, Computational Linguistics, Volume 25, Number 2, pages 237-264, June 1999. 3) Beatrice Santorini. “Part-of-Speech Tagging Guidelines for the Penn Treebank Project.(3rd revision, 2nd printing)”, Linguistic Data Consortium,June 1990. 4) Taku Kudoh and Yuji Matsumoto. “Chunking with Support Vector Machines”, Information Processing of Japan,NL 140, pages 9-16, 2000. 5) Rafael C.Carrasco and Mikel L.Forcada. “Incremental Construction and Maintenance of Minimal Finite-State Automata”, Computational Linguistics, Volume 28, Number 2, pages 206-217, June 2002. 6) Andrei Mikheev. “Periods, Capitalized Words, etc” Computational Linguistics 28, 3, 289-318 2002
© Copyright 2026 Paperzz