R-149_LeongKS.pdf

COMPUTATIONAL METHODS IN ENGINEERING AND SCIENCE
EPMESC X, Aug. 21-23, 2006, Sanya, Hainan, China
©2006 Tsinghua University Press & Springer
CSAT: A Chinese Segmentation and Tagging Module Based on the
Interpolated Probabilistic Model
K. S. Leong *, F. Wong, C. W. Tang, M. C. Dong
Faculty of Science and Technology of University of Macau, Av. Padre Tomás Pereira, Taipa, Macau, China
Email: {ma56538, derekfw}@umac.mo
Abstract Chinese is a challenging language in natural language processing. Unlike other languages such as English,
Portuguese, the first step in Chinese text processing is the word identification because there are no delimiters in a
Chinese sentence for identifying the words boundaries. And there are many ambiguity problems during Chinese
processing like segmentation ambiguities, unknown words problem, part of speech ambiguities, etc. In this paper,
CSAT, an integrated application with Chinese segmentation and tagging ability has been implemented and the results
of experiments of the implemented application are presented. And in between, the adoption of the interpolated
probabilistic tagging model for Chinese is discussed and introduced. About the CSAT, its segmentation module is
based on the Chinese words rough segmentation model based on N-Shortest-Paths method. Using this method, in the
first phase of the Chinese processing, a rather good rough segmentation result of Chinese sentences is obtained. Upon
getting the rough segmentation result, in the second phase, those segmented candidate Chinese sentences are going to
be tagged by the probabilitistic tagger. From the experiments, the accuracy of tagging is 95.20%. Our experiments are
based on a one-month news corpus from the People’s Daily. This application is the first step for Chinese processing
and it will be used as a preprocessing module in the Chinese to Portuguese machine translation system.
Key words: N-shortest-paths method, words rough segmentation, probabilistic tagging, interpolated probabilistic
tagging
INTRODUCTION
To process a Chinese sentence, several steps have to be gone through like word segmentation, part of speech tagging,
syntactic analysis, semantic and pragmatic analysis. In this paper, we are going to focus on the first two steps. They are
the word segmentation and part of speech tagging.
In Chinese text processing, the first essential step is the word segmentation because Chinese, unlike other languages
such as English, Portuguese, both have word boundaries within a sentence. Therefore, to segment the Chinese
sentences to find out the word boundaries within them, one needs to look up a dictionary or knowledge base with
statistics about the Chinese words to achieve it. In the literature, there are several methods proposed for doing this like
full segmentation with words N-gram model, maximum matching, maximum matching with rule-based approach [2],
maximum probability, blending segmentation with tagging [3, 4]. The problems of these methods are that they may
produce a large set of candidate segmented sentences for later processing, may produce some assertive results or some
words ambiguities cannot be solved in the first phase. Those problems will usually greatly reduce the performance of
the system or affect the accuracy of the processing in the later phases.
Since we know that the result of segmentation is the input of the next phase, the part of speech tagging, then the high
accuracy of the segmentation phase will surely give a better result of part of speech tagging. So in this paper, we have
adopted the Chinese words rough segmentation model based on N-shortest-paths method [1] as our segmentation
strategy in the first phase. With the result got from this phase, this is passed as the input to probabilistic tagging model
in the second phase. By this method, we have tried several experiments with different arrangements. And we found that
the accuracy of tagging reaches 95.20%. In the following sections, we will discuss this method and the results obtained
from different experiments.
⎯ 1092 ⎯
WORDS ROUGH SEGMENTATION MODEL BASED ON N-SHORTEST-PATHS METHOD
As there are problems in the segmentation methods mentioned above, we are going to have a better segmentation
method in which can give good rough segmenatation results for the later phases while the performance of the
application can be maintained or improved. We thought that the words rough segmentation model based on
N-shortest-paths method [1] is such method.
This method is actually via a rough segmentation on Chinese sentences and gets a certain amount of rough segmented
sentences. After that, through a number of experiments to determine the best value for N and get the minimum number
of rough segmented sentences that can include the final correct segmented result. In short, this is an extension of the
shortest path algorithm and its basic idea is as follow.
For a given Chinese sentence, it is first split up into its atomic characters. And for this, we create a directed graph for
the sentence with each of its atomic characters as the vertices (V1, V2, … ,Vn) as in the Fig. 1 (V0 represents the starting
node).
Figure 1: The directed graph for words rough segmentation
After that, a knowledge base keeps the statistics about the word frequencies based on a Chinese corpus will be looked
up for the probabilities of the atomic characters or the combinations of characters (words). Suppose that w (word) = Vi
Vi+1 … Vj , then a directed edge <Vi-1 , Vj> will be added from the vertices Vi-1 to Vj . And the length Lw of that edge will
be assigned as the probability of w. From the graph, we see that there must be a directed edge between adjacent vertices
Vi, Vi+1. If there is not an atomic character Vi , then the length Li of the directed edge <Vi-1, Vi> will be assigned with a
smoothed probability but for any w = Vi Vi+1 … Vj which does not appear in the training corpus, then no directed edge
<Vi-1, Vj> will be added from the vertices Vi-1 to Vj.
Based on the statistics kept in the knowledge base, the probability of each w will be used in this words rough
segmentation model. Thus it is based on unigram statistics model.
Let W = w1, w2, … , wm and this is regarded as one of the results of the segmentation for the Chinese sentence C. Then
P (W | C ) =
P(W ) P (C | W )
P (C )
(1)
where P(C) is the probability of the Chinese sentence to be segmented. This must be a constant for all different
segmentations. And from different segmentations to get the whole Chinese sentence, the probability is P(C | W)= 1
because there is only one way to do so. Therefore, the goal is to obtain the N different segmentations which have the N
largest probabilities of P(W). Now wi is a word, P(wi) represents the probability of wi that appears in the training corpus.
In case wi does not appear in the training corpus, add one smoothing is applied. Therefore P(wi) can be approximated as
P ( wi ) ≈
( ki + 1)
(2)
m
(∑ k j + V )
j =0
where ki is the number of occurrences of wi and V is the number of word types in the training corpus.
In the rough segmentation phase, it is assumed that the context within the sentence is not considered for simplicity.
Then words are independent and thus from (1) and (2),
m
m
arg max P(W ) = arg max ∏ P ( wi ) ≈ arg max(∏
W
i =1
i =1
(ki + 1)
m
∑k
j =0
j
)
+V
For convenience, it is preferred to change the maximum value problem to minimum value problem. Then
⎯ 1093 ⎯
(3)
m
m
m
i =1
i =0
j =0
arg min P (W ) = arg min[− ln ∏ P( wi )] ≈ arg min ∑ [ln(∑ k j + V ) − ln(ki + 1)]
W
(4)
With the directed graph and the different probabilities of different rough segmentations are calculated, the N rough
segmentation results with the mimimum values in (4) can be obtained.
PROBABILISTIC TAGGING MODEL
In the work, we focus on applying the results got from the Chinese words rough segmentation model based on
N-shortest-paths method [1] to carry on the second phase of Chinese processing which is the part of speech tagging.
For this phase, we have applied the well-known probabilistic model known as Hidden Markov Model [7]. To find the
most likely part of speech for each word in a sentence, a probabilistic model can be formulated. Let the most likely part
of speech sequence be T = {t1, t2, … , tn} given a particular words sequence, W = {w1, w2, … , wn} Using Bayes' rule we
have
P (T | W ) =
P (W | T ) P (T )
P (W )
(5)
where P(T) is the priori probability of tags sequence T, P(W) is the unconditional probability of words sequence W and
P(W | T) is the conditional probability of words sequence W given the tags sequence T. Then to find the most likely tags
sequence T for the words sequence, we need to find the tags sequence T that can maximize P(T | W). Because W must
be the same for all different tags sequences, the P(W) is not needed to consider. So now, (5) can be rewritten as
n
P (T | W ) ≈ ∏ P (ti | ti −1 , ti − 2 ,..., t1 ) P ( wi | ti ,..., t1 , wi −1 ,..., wi )
(6)
i =1
But then instead of letting wi depend on all previous words and tags, one assumes wi depends only on ti and similarly
the current tag ti depends only on the previous tag ti-1 based on the assumption that the local context is enough. Then by
simplify it to be a bigram model, we get the following,
arg max P(T | W ) ≈ arg max ∏ P (ti | ti −1 ) P ( wi | ti )
T
(7)
i
By (7), the tags sequence T which can maximize P(T | W) will be the final result for the words sequence W.
EXPERIMENTS
As we have mentioned above, we aim at applying the rough segmentation results of each Chinese sentence by the
method described above to pass into the tagging module for part of speech tagging. Actually, to determine which
tagged Chinese sentence of a given Chinese sentence should be the correct output in the final result, we just base on the
following to determine: Let S = {W1, W2, … , Wn} be the set of N best rough segmented Chinese sentences (the
segmented sentences whose P(Wi) (1<=i<=N) is the first N-th smallest among all the rough segmented sentences) of
the given Chinese sentence in the segmentation phase and P(T | Wi) (1 <= i <= N) be the probability of the tagged
sentence (see in the above section) of each corresponding segmented Chinese sentence Wi (1<=i<=N). Then the
measure for selecting the best tagged sentence is just based on the value of P(T | Wi) (1<=i<=N). That means, we just
need to find out which tagged sentence has the largest value of P(T | Wi) (1<=i<=N). Then that tagged sentence will be
the final result.
For the experiments, we first trained our segmentation and tagging module based on the one-month news corpus of the
People’s Daily [8] and then it was used to segment and tag 2,606 test sentences (about 5% sentences from the news
corpus). We have carried out two kinds of experiments. They will be described later. For each kind of experiments, the
results of experiments are divided into two parts. One part is about the recall rate of the correct segmentations that are
included in the possible candidates Wi, 1<=i<=N. We compared the accuracy by doing the experiments of setting
different values of N where N is the number of best results produced by the segmentation module for a given sentence.
Another part is about the results of the accuracy of the part of speech tagging. We also compared the accuracy by doing
the experiments of passing different amount of segmented sentences for a given sentence to the tagging module.
1. Experiment 1 Below are the tables of results of this experiment.
⎯ 1094 ⎯
Table 1 Accuracy of the rough segmentation of the experiment 1
N
Mis-segmented sentences
Recall rate (%)
1
303
88.37%
5
70
97.31%
10
18
99.31%
15
8
99.69%
20
7
99.73%
(N: the N best segmentation candidates for a given sentence, recall rate = number of correct segmented sentences
/ number of total test sentences)
Table 2 Accuracy of the part of speech tagging of the experiment 1
N
Mis-tagged tokens
Correct rate (%)
1
2214
90.33%
5
1140
95.02%
10
1109
95.15%
15
1104
95.18%
20
1102
95.19%
(N: the number of possible segmented sentences are considered in the tagging module. The candidate that has the
highest score produced by the tagger is considered as the final result)
In this experiment, as we want to have some comparisons with the results of experiment 2 described in the next
subsection. We first modified the model of the segmentation module a little bit. It is still based on the words rough
segmentation model but this model is not based on statistics (the lengths of the directed edges of the corresponding
words are all assigned with value 1).
2. Experiment 2 For this experiment, the segmenation part is based on the words rough segmenation model decribed
above. Below are the tables of results.
Table 3: Accuracy of the rough segmentation of the experiment 2
N
Mis-segmented sentences
Recall rate (%)
1
191
92.67%
5
10
99.62%
10
4
99.85%
15
2
99.92%
20
1
99.96%
Table 4: Accuracy of the part of speech tagging of the experiment 2
N
Mis-tagged tokens
Correct rate (%)
1
1197
94.77%
5
1100
95.20%
10
1100
95.20%
15
1100
95.20%
20
1100
95.20%
3. Discussion of results From the results in experiment 1, when the words rough segmentation model is not based on
statistics, the correction percenatage is not that high when the value N is small. And thereby, the correction percentage
of the tagging module is also affected accordingly. But compare the result of Table 3 with that of Table 1, also can see
in the Fig. 2, we can see that when the words rough segmenation model is based on statistics, the number of
mis-segmented sentences in the segmentation is less than that not based on statistics model. More correct segmented
sentences can be obtained earlier (compare the same row of results in Table 1 and Table 3 or see the Fig. 2) than
⎯ 1095 ⎯
100.00%
98.00%
Recall rate
96.00%
Words rough segmentation model
not based on statistics
94.00%
92.00%
Words rough segmentation model
based on statistics
90.00%
88.00%
86.00%
1
5
10
15
20
N best segmentation candidates for a given sentence
Figure 2: Comparison of the results of experiments of words rough segmentation
that not based on statistics. This can show that the words rough segmentation model based on statistics information is
more efficient.
And we can observe that the correction percentage of the tagging module will be increased as the correction percentage
of the segmentation module is increased. This can show that the quality of results that can be got from the earlier phase
is very important for the later phase of processing. A graph Fig. 3 is also shown below for the comparison of results of
experiments in tagging.
Figure 3: Comparison of results of experiments in tagging
TAGGING MODULE BASED ON THE INTERPOLATED PROBABLISTIC MODEL
From the two experiments above, we can see that the accuracy of the tagging module can still be improved. But of
course, whether its accuracy can be raised still depends on the results obtained from the previous phase, i.e. the
segmentation. Suppose that we could get correct results from the segmentation phase and get some words like
, etc. out of some sentences. However, those words are only
recovered by some postprocessing (described in the next section) in the segmentation phase. That means they never
appear in the training corpus, i.e. they are unknown words. Therefore, in the tagging phase, if the words are going to be
tagged by a general probabilistic tagger. Probably, they will not be tagged correctly. But one can observe that those
words are having some prefixes and suffixes such as
in
in
(many Chinese idioms has 一 as the starting word). These will be similar to those of
some other words already occurred in the training corpus frequently, then if the knowledge base can keep some
statistics about the frequency of those prefixes and suffixes within words in the training corpus, know which tags are
⎯ 1096 ⎯
assigned to them and the tagging module could make use of those statistics when it comes across the unknown words,
then it would be able to make a good guess in assigning tags to those unknown words.
Actually, to achieve the good guess by the tagging module, we can adopt the method of the interpolation of the feature
within a word [6] for the tagging module. Hence, we can also have a similar formulation of this method for Chinese text
processing.
From [6], it has given the formulation for such method in dealing with Portuguese. It said that to capture the effect of
the capatilization (in Chinese, we do have prefix rather than capatilization) or suffix of a word that can affect the part of
speech assigned to the word, then the word features can be interpolated into word probability. Recall that in the
equation (7), consider its word probability:
P ( wi | ti ) =
P (ti | wi ) P ( wi )
P (t i )
(8)
The probability can be interpolated into the word probability as expressed in the following equation:
Pinterp ( wi | ti ) ≈
P ( wi )
[λ1 P(ti | wi ) +
P (ti )
(9)
λ2 P (ti | pre(1) i ) + λ3 P(ti | suf (1) i ) + λ4 P(ti | suf (2) i )]
where in pre(n)i, n means the number of characters of prefix needed to use within the Chinese word, in suf(n)i, n menas
the number of characters of suffix needed to use within the Chinese word and λi (i = 1..4) are the interpolation
coefficients and ∑ λi = 1 (optimal values can be obtained by performing experiments during the training phase).
i
Here, we have made some changes from the one in [6] to suit the cases in Chinese. Then (9) can be further rewritten as:
Pinterp ( wi | ti ) ≈
λ1 P(ti | wi ) P( wi )
P (ti )
+
P( wi )
[λ2 P(ti | pre(1) i ) + λ3 P (ti | suf (1) i ) + λ4 P(ti | suf (2) i )]
P ( ti )
Pinterp ( wi | ti ) ≈ λ1 p ( wi | ti ) +
p( wi )
[λ2 P(ti | pre(1) i ) + λ3 P (ti | suf (1) i ) + λ4 P (ti | suf (2) i )]
p(ti )
(10)
In the above mentioned, the interpolated model allows arbitrary features in the context. If one thinks that it can be
worth to get more number of characters of the prefix or suffix within the word being tagged, of course more
interpolated terms can be added to the equation (10) to suit the need.
Following the discussion, we conclude that the tagging module of CSAT can be based on this interpolated model. At
this moment, we are trying to improve the tagging moduel of CSAT with this idea.
FUTURE IMPROVEMENT
From the experimental results above mentioned, we can see that the performance of the segmentation module still has
rooms to improve. Actually, when we observed the result after the words rough segmentation had been performed, we
could see that those mis-matched sentences contain some proper nouns such as persons’ names, place names, fixed and
derived
iterative expressions (words composed of two or four characters with duplication) such as
words (words formed from stem + affix/affix-like structure) such as
, time expressions like1月, monetary expressions
etc., quantity expressions such as date expressions like
percentages and generic number expressions like
.
like
For the cases described above, indeed we can do some postprocessings as suggested in [5] after the rough segmentation
phase. For example, for the cases of fixed and iterative expressions, we can store some patterns which can frequently
occur in the knowledge base. And then when an unsegmented sentence is given, the patterns collected in the
knowledge base are used to look for those patterns in the unsegmented sentence to get a refined result compared with
the rough segmentation result. For the cases of derived words, we can also keep a set of prefixes, suffixes in the
knowledge base for matching after the rough segmentation. If number sequences can be found, then all the number
⎯ 1097 ⎯
characters are grouped to form the quantity expressions. If the special prefixes or suffixes are found, then the derived
words will be recognized in the refined segmentations. For the cases of persons’ names and place names, we may store
some well-known people’s names and place names in the knowledge base for matching in the postprocessing process
and for other people’s names, we store the surnames with different priorities in the knowledge base and then apply the
algorithm described in [5] which uses N-gram along with the constraint of the surnames priority to make the best guess
of Chinese names.
For the tagging process, there are also rooms to improve. For example, rule-based approach can be applied to
disambiguate against some special words (e.g., mono-syllablic words like ) depends on the context [4]. Interpolated
probabilistic tagging [6] based on the features within a Chinese word can be applied to improve the tagging accuracy.
CONCLUSION
In this paper, we have presented the experimental results about the text processing of Chinese in the segmentation and
tagging phases. In the segmentation phase, we have tried experiments based on the Chinese words rough segmentation
model based on N-shortest-paths method [1] but with different settings as described above. From the experimental
results, they showed that the words rough segmentation based on statistics is more efficient and the quality of results
from the segmentation phase is very important for the later phases in processing. Then a tagging model based on
feature interpolation [6] has been analyzed and proposed to adopt for Chinese text processing. Finally, we have
highlighted some ways in [4, 5] that can be adopted for improvement for both the segmentation and part of speech
tagging. At this moment, development and improvement are still on the way for the CSAT introduced in this paper in
order to get satisfactory results. CSAT will be used as a preprocessing module in the Chinese to Portuguese machine
translation system.
Acknowledgements
The research work reported in this paper was partially supported by "Fundo para o Desenvolvimento das Ciências e da
Tecnologia" under grant 041/2005/A.
REFERENCES
1. Zhang HP, Liu Q. Model of Chinese words rough segmentation based on N-shortest-paths method. Journal of
Chinese Information Processing, 2002; 16(5): 1-7.
2. Tsai CH. MMSEG: A word identification system for mandarin Chinese text based on two variants of the maximum
matching algorithm. Unpublished manuscript, University of Illinois at Urbana-Champaign, USA, 1996.
3. Sun MS, Xu DL, Benjamin KT. Integrated Chinese word segmentation and part-of-speech tagging based on the
divide-and-conquer strategy. Proceedings of the IEEE-NLPKE, Beijing, China, 2003, pp. 610-615.
4. Zhou Q, Yu SW. Blending segmentation with tagging in Chinese language corpus processing. Proceeding of
COLING-94, August, 1994.
5. Jin WY. NMSU Chinese segmenter. The 1st Chinese Language Processing Workshop, Philadelphia, USA, 1998.
6. Wong F, Chao S, Hu DC, Yu HM. Interpolated probabilistic tagging model optimized with genetic algorithm.
Proceedings of the Third International Conference on Machine Learning and Cybernetics 2569-2574. Shanghai,
China, 2004.
7. Daniel J, James HM. Speech and Language Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition. Prentice Hall, 2000, Chapter 7.
8. One-month news corpus of the People’s Daily. http://www.icl.pku.edu.cn/icl_res/ (accessed April 2006)
⎯ 1098 ⎯