A Part-of-speech Tagging Model Employing Word Clustering and

Chinese Journal of Electronics
Vol.23, No.1, Jan. 2014
A Part-of-speech Tagging Model Employing
Word Clustering and Syntactic Parsing∗
YUAN Lichi1,2
(1.School of Information Technology, Jiangxi University of Finance and Economics, Nanchang 330013, China)
(2.Jiangxi Key Laboratory of Data and Knowledge Engineering, Jiangxi University of Finance and Economics,
Nanchang 330013, China)
Abstract — Part-Of-Speech tagging is a basic task in
the field of natural language processing. This paper builds
a POS tagger based on improved Hidden Markov model,
by employing word clustering and syntactic parsing model.
Firstly, In order to overcome the defects of the classical HMM, Markov family model (MFM), a new statistical model was introduced. Secondly, to solve the problem of data sparseness, we propose a bottom-to-up hierarchical word clustering algorithm. Then we combine syntactic parsing with part-of-speech tagging. The Part-ofSpeech tagging experiments show that the improved PartOf-Speech tagging model has higher performance than
Hidden Markov models (HMMs) under the same testing conditions, the precision is enhanced from 94.642% to
97.235%.
Key words — Part-of-speech tagging, Hidden Markov
model, Markov family model, Word clustering.
I. Introduction
Tagging words with their correct part-of-speech (singular
proper noun, predeterminer, etc.) is an important precursor to
further automatic natural language processing. Part-of-speech
tagging is used as an early stage of linguistic text analysis
in many applications, including subcategorization acquisition,
text-to-speech synthesis, and corpus indexing. Two prominent
distinct approaches to be found in previous work are rulebased morphological analysis on the one hand, and stochastic
model[1−13] such as Hidden Markov models[1] (HMMs) on the
other hand.
Rule-based morphological analysis relies on hand-crafted
rules to decompose input tokens into their morphological components, computing the resultant lexical category as a function
of those components. Such systems incorporate the linguistic
competence of their human authors, to the extent that such
competence can be and is expressed in the systems’ rule sets.
Unfortunately, the construction of hand-crafted rule set for unrestricted input tokens of a given language is a time-consuming
and labor-intensive task. Another common problem for token-
wise rule-based approaches is that of ambiguity-in order to determine which of multiple possible analyses of a single token
is the correct one, some reference in the context in which the
token occurs is usually required.
The recent top performing statistic-based approaches to
Chinese POS tagging include Hidden Markov Models[1] , Maximum Entropy approaches[2] , and Condition Random Fields
learning[3] . Stochastic tagging techniques such as Hidden
Markov Models rely on both lexical and bigram probabilities
estimated from a tagged training corpus in order to computer
the most likely PoS tag sequence for each sequence of input
tokens. The existence of hand-tagged training corpora for
many languages and the robustness of the resulting models
have made stochastic taggers quite popular. Disadvantages
for HMM taggers include the large amount of training data
required to achieve high levels of accuracy, as well as the fact
that no clear allowance is made in traditional HMM tagging
architectures for prior linguistic knowledge.
The Hidden Markov Models used for tagging have three
assumptions[1] : (1) limited horizon, (2) time invariant (stationary), (3) simplifying assumption: probability of a word
depends only on its own tag, but these assumptions (especially the third assumption) are too crude. In this paper,
Markov Family model[14] , a new statistical model was introduced. Under the assumption that the probability of a word
depends both on its own tag and previous word, but its own
tag and previous word are independent if the word is known,
we simplify the Markov Family model and use for part-ofspeech tagging successfully. Experimental results show that
this part-of-speech tagging method based on Markov Family
model has greatly improved the precision comparing the conventional POS tagging method based on Hidden Markov model
under the same testing conditions.
The rest of the paper is organized as follows: Section II
introduces the baseline POS tagging model based on Hidden
Markov models and purposes a POS tagging method based on
improved Hidden Markov model. A clustering algorithm will
then be introduced in Section III, and Section III explores the
∗ Manuscript Received Apr. 2012; Accepted May 2013. This work is supported by the National Natural Science Foundation of China
(No.61262035); the Science and Technology Foundation of Education Department of Jiangxi Province, China (No.GJJ12271, No.GJJ12742);
the Natural Science Foundation of Jiangxi Province, China (No.20122BAB201033).
Chinese Journal of Electronics
110
notion that automatically built tagger performance can be further improved by smoothing when the labeled training corpus
is limited. Section IV further improves tagger performance by
combining syntactic parsing with part-of-speech tagging.
2014
can find the sequence of tags t1,n = t1 , · · · , tn that maximizes
the probability of the tag sequence given the word sequence
w1,n = w1 , · · · , wn .
argmax P (t1,n |w1,n ) = argmax
t1,n
II. The POS Tagging Model Based on
Improved HMM
t1,n
t1,n
t1,n
t1,n
P (w1,n |t1,n )P (t1,n ) =
n
P (w1,n |t1,n ) =P (wn |w1 , · · · , wn−1 , t1 , · · · , tn−1 , tn )
· P (w1,n−1 |t1,n−1 )
P (w1,n |t1,n ) = P (wn |wn−1 , tn ) · P (w1,n−1 |t1,n−1 )
P (wn−1 , tn |wn ) · P (wn )
P (wn−1 , tn )
P (tn |wn ) · P (wn−1 |wn ) · P (wn )
=
P (tn |wn−1 ) · P (wn−1 )
P (tn |wn ) · P (wn |wn−1 )
=
P (tn |wn−1 )
P (wn |wn−1 , tn ) =
i=1
× P (tn−1 |tn−2 ) × · · · × P (t2 |t1 )
n
P (wi |ti ) × P (ti |ti−1 )
(2)
=
argmax P (t1,n |w1,n )
t1,n
= argmaxP (w1 |t1 ) · P (t1 )
P (wi−1 , ti |wi ) = P (wi−1 |wi ) · P (ti |wi )
(3)
For simplicity, also suppose that word sequence {wi }i≥1
and tag sequence {ti }i≥1 are all 2-order Markov chain, thus
n
P (ti |wi ) · P (ti |ti−1 )
P (ti |wi−1 )
i=2
(8)
Once have a probabilistic model, the next challenge is to
find an effective algorithm for finding the maximum probability tag sequence given an input. The Viterbi Algorithm[1] is
a dynamic programming method which efficiently computers
for a given word sequence w1 , · · · , wn most likely to generate
the tag sequence t1 , · · · , tn according to the model parameters.
The computer proceeds as follows (see Fig.1).
i=1
(We define p(t1 |t0 ) = 1.0 to simplify our notation.)
2. The POS tagging model based on improved
HMM
A major unrealistic assumption with HMM tagging model
is that successive words (observations) are independent and
identical distribution within a tag (state). Under the assumption that the probability of a word depends both on its own
tag and previous word, but its own tag and previous word are
independent if the word is known, Markov Family model has
been successfully applied to Part-of-speech tagging.
Let S1 be the finite set of Part-of-Speech tags, S2 be the
finite set of words. Suppose word sequence {wi }i≥1 and tag
sequence {ti }i≥1 are Markov chain of Markov Family model,
thus a word’s tag and its previous word are independent if the
word is known:
(7)
So
t1,n
× P (tn−1 |t1,n−2 ) × · · · × P (t2 |t1 )
n
P (wi |ti ) × P (tn |tn−1 )
=
(6)
From the Eq.(3), can get
P (wi |t1,n ) × P (tn |t1,n−1 )
i=1
(5)
According to the properties of Markov family model, have:
(1)
We now introduce this expression to parameters that can
be estimated from the training corpus. In addition to the
Limited Horizon assumption, we make two assumptions about
words: words are independent of each other, and a word’s
identity only depends on its tag.
(4)
where
P (w1,n |t1,n )P (t1,n )
P (w1,n )
= argmax P (w1,n |t1,n )P (t1,n )
P (w1,n |t1,n )P (t1,n )
P (w1,n )
= argmax P (w1,n |t1,n )P (t1,n )
1. The POS tagging model based on HMM
For a tagset (T) and a finite set of word (W), it is customary to define a bigram HMM part-of-speech tagger (T, A,
W, B, π), where the probability functions A, B, and π are estimated from a tagged training corpus. Under such a model,
part-of-speech tags are represented as states of the model, and
the task of finding the most likely tag sequence t1,n for an
input word sequence w1,n can be formulated as a search for
the most likely sequence of HMM states given the observation
sequence w1,n :
argmax P (t1,n |w1,n ) = argmax
t1,n
comment: Given: a sentence of length n, the number of the
tag set is T .
comment: initialization
δ1 (tj ) = P (w1 |tj ) · P (tj ), 1 ≤ j ≤ T
Ψ (tj ) = 0 1 ≤ j ≤ T
comment: Induction
for i := 1 to n − 1 step 1 do
for all tags tj do
δi+1 (tj ) = max [δi (tj ) × P (tj |wi+1 ) × P (tj |tk )/P (tj |wi )]
1≤k≤T
Ψi+1 (tj ) = argmax[δi (tj ) × P (tj |wi+1 ) × P (tj |tk )/P (tj |wi )]
1≤j≤T
end
11
end
12
comment: Termination and path-readout, X1 , · · · , Xn are the
tags choose for words w1 , · · · , wn
13
Xn = argmax δn (j)
14
15
16
17
1≤j≤T
for j := n − 1 to 1 step −1 do
Xj = Ψj+1 (Xj+1 )
end
P (X1 , · · · , Xn ) = argmax δn (j)
1≤j≤T
Fig. 1. Algorithm for tagging
A Part-of-speech Tagging Model Employing Word Clustering and Syntactic Parsing
III. The POS Tagging Model Employing
Word Clustering
1. Clustering algorithm based on word similarity
Word clustering[15−19] is an important fundamental work
of automatic language processing. A large amount of previous research has focused on how to best cluster similar words
together. There are many different clustering algorithms, but
they can be classified into a few basic types. There are two
types of structures produced by clustering algorithms, hierarchical clustering and flat or non-hierarchical clustering. Flat
clustering simply consists of a certain number of clusters and
the relation between clusters is often undetermined. Most algorithms produce flat clusters and improve them by iterating
a reallocation operation that reassigns objects. The tree of
a hierarchical clustering can be produced either bottom-up,
by starting with the individual objects and grouping the most
similar ones, or top-down, whereby one starts with all the objects and divides them into groups so as to maximize withingroup similarity.
Conventional statistical clustering methods usually base on
greedy principle. The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the
corpus. Conventional clustering algorithms often converge to
a local optimum, so global optimum is not guaranteed, and initial choices can influence final result. In order to solve above
problems, a definition of word similarity by utilizing mutual
information was presented. Based on word similarity, the definition of word set similarity was given. Experiments[20] show
that word clustering algorithm based on similarity is better
than conventional greedy clustering method in speed and performance.
Unlike above method, the base that we cluster words is
the similarity[20] between words. First, we must find an appropriate word similarity metric. The common corpus-based
approach for computing word similarity is based on representing a word (or term) by the set of its word co-occurrence statistics. It relies on the assumption that the meaning of words is
related to their patterns of co-occurrence with other words in
the text. This assumption was proposed in early linguistic
work, as expressed in Harris’ distributional hypothesis: “· · ·
the meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations
of these entities relative to other entities”. The famous statement “You shall know a word by the company it keeps!” is
another expression of this assumption.
Assume that the two words w1 and w2 are similar, and
then we may infer that they have similar mutual information
with some other words. Now we define the similarity between
words w1 and w2 as follow:
sim(w1 , w2 ) =
P (w)[min(I(w, w1 ), I(w, w2 )) + min(I(w1 , w), I(w2 , w))]
w
P (w)[max(I(w, w1 ), I(w, w2 )) + max(I(w1 , w), I(w2 , w))]
w
(9)
where I(wi , wj ) represents the point wise mutual information
111
between the two words wi and wj .
I(wi , wj ) = log
p(wi , wj )
p(wi )p(wj )
(10)
where p(wi ) and p(wj ) are the probabilities of the events w1
and w2 (occurrences of words, in our case) and p(wi , wj ) is the
probability of the joint event (a concurrence pair).
Furthermore, we define the left similarity and right similarity between words w1 and w2 as:
P (w) min(I(w1 , w), I(w2 , w))
w
simL (w1 , w2 ) = P (w) max(I(w1 , w), I(w2 , w))] (11)
P (w) min(I(w, w1 ), I(w, w2 ))
w
w
simR (w1 , w2 ) = w
P (w) max(I(w, w1 ), I(w, w2 ))] (12)
Based on word similarity, the similarity between clusters
C1 and C2 may be defined as:
C(wi )C(wj )sim(wi , wj )
sim(C1 , C2 ) =
wi ∈C1 ,w∈C2
w∈C1
C(wi )
(13)
C(wj )
wj ∈C2
where C(wi ), C(wj ) represent the number of the words wi and
wj occur in the corpus. The left similarity and right similarity
between clusters may be defined similarly.
The clustering algorithm is as follow:
(a) Compute similarity between words.
(b) Begin with N (N is the number of words in lexicon)
clusters, one for each word.
(c) Select the two clusters which have the biggest similarity, and create a new cluster by merging the two clusters
together.
(d) Computer the similarity between the new cluster and
other cluster.
(e) Check if the termination condition (the value of the
biggest similarity between clusters is less than a predetermined
threshold, or the desired number of clusters is reached) is meet,
if yes, the program is terminated; or go to (c).
2. The POS Tagging model Employing Word Clustering
To solve the problem of sparse data, the parameters of
Eq.(8) can be estimated by smoothing method, such as:
P (ti |wi ) = λCi P (ti |wi ) + (1 − λCi )P (ti |Ci )
(14)
where Ci represents the cluster that wi belongs to, λCi ,
0 < λCi < 1 is a smoothing parameter.
IV. The POS Tagging Model Combined
with Syntactic Parsing
Traditionally, in Chinese language processing, word segmentation, POS tagging and syntactic parsing are implemented sequentially. That is, the input Chinese sentence is
segmented into words first, secondly the segmented result is
passed to POS tagging component, then syntactic parsing is
proceeded on the base of POS tagging. It can be seen that
Chinese Journal of Electronics
112
Part-of-Speech is an indispensable feature in syntactic parsing.
However, this processing strategy has following disadvantages:
(1) The word lexicons used in POS tagging and syntactic parsing may be different, this difference is difficult to overcome and
largely drops the system accuracy although different optimal
algorithms may be applied to POS tagging and syntactic parsing; (2) With speed in consideration, the two-stage processing
strategy is not efficient. Therefore, we apply the strategy of integrating POS tagging and syntactic parsing in a single stage.
Head-driven statistical models for natural language
parsing[21] are the most representative lexicalized syntactic
parsing models. To take advantage of the role of lexicalized information, each PCFG rule in Head-driven statistical models
can be lexicalized by associating a word w and a Part-of-speech
(POS) tag t with each nonterminal X in the tree. Due to the
introduction of lexicalized information, the serious problem
of sparse data is unavoidable. To relieve the data sparseness
problem, the right of each rule in Head-driven statistical models is broken up into three parts: head-child; left and right
modifiers of head-child. The first thing to note is that each
internal rule in a lexicalized PCFG has the form:
P (ht, hw) →Lm (ltm , lwm ) · · · L1 (lt1 , lw1 )H(ht, hw)
R1 (rt1 , rw1 ) · · · Rn (rtn , rwn )
m+1
·
(19)
Eq.(18) is the probability about Part-of-Speech tagging in
syntactic parsing. Again supposes rwi , rti−1 , rti−2 , · · · , rt1 are
conditional independent with respect to rti , then we have:
Pi (rti |rti−1 , rti−2 , · · · , rt1 , rwi )
=
Pi (rti |rti−1 , rti−2 , · · · , rt1 ) · Pi (rti |rwi )
Pi (rti )
(20)
The probability in formula (20)
Pi (rti |rti−1 , rti−2 , · · · , rt1 )
Pi (rti )
Pi (rti , rti−1 , rti−2 , · · · , rt1 )
=
Pi (rti ) · Pi (rti−1 , rti−2 , · · · , rt1 )
(21)
is the point wise mutual information between rti , rti−1 ,
rti−2 , · · · , rt1 , so the probability meaning of Eq.(20) is very
clear, and is also agree with language phenomenon. The probability Pi (rti |rti−1 , rti−2 , · · · , rt1 ) in Eq.(20) can be solved
by introducing the Part-of-speech tagging model based on
the Part-of-Speech collocation relationship between adjacent
words.
V. Experimental Results
1. Word clustering experiments
(1) Corpus
We use some annotated corpora selected from People’s
Daily newspaper 1998 for training and testing. Firstly, to
compare the greedy clustering method based on minimum entropy and the hierarchical clustering method based on word
similarity[20] , we take a 2M corpus for testing and some corpora for comparing the run time of two clustering method.
(2) Word clustering experimental results
The experiment results are demonstrated in Table 1.
Pi (Li (lti , lwi )|H, P, h, Δl (i − 1))
i=1
n+1
Pi (rwi |H, P, h, Δr (i − 1))
(15)
where H(ht, hw) is the head-child of the rule, which inherits the headword/tag pair h from its parent P . Lm (ltm , lwm )
· · · L1 (lt1 , lw1 ) and R1 (rt1 , rw1 ) · · · Rn (rtn , rwn ) are left and
right modifiers of H(ht, hw). Either n or m may be zero, and
n = m = 0 for unary rules. We will extend the left and right
sequences to include a terminating STOP symbol, allowing a
Markov process to model the left and right sequences. Thus
Lm+1 = Rn+1 = ST OP . The probability of an internal rule
Eq.(15) can be rewritten (exactly) using the chain rule of probabilities:
Ph (H|P (ht, hw)) ·
2014
Pi (Ri (rti , rwi )|H, P, h, Δr (i − 1))
(16)
i=1
Here Δl (i − 1) and Δr (i − 1) are functions of the surface string
below the previous modifiers.
To Integrate Part-of-speech Tagging with syntactic parsing, the probability of an internal rule Eq.(15) is modified as
follow:
Table 1. Clustering experiment results
of two algorithms
Clustering
Greedy
Algorithm based
algorithm
algorithm
on similarity
Perplexity
283
218
Accuracy rate
91.54%
95.38%
From Table 1, it can be seen that the perplexity is reduced from 283 to 218, and word clustering algorithm based
on similarity has better performance than conventional greedy
m+1
clustering method.
Pi (Li (lti , lwi )|Li−1 (lti−1 , lwi−1 ), · · · ,
Ph (H|(ht, hw) ·
As shown in Fig.2, the run time of greedy clustering
i=1
method increases sharply as the size of corpus increases, and
L1 (lt1 , lw1 ), H, P, h)
the efficiency of hierarchical clustering method based on word
n+1
similarity is obviously higher than the efficiency of greedy clus
Pi (Ri (rti , rwi )|Ri−1 (rti−1 , rwi−1 ), · · · ,
·
tering method. With the testing size becoming larger, the imi=1
pact of word clustering of hierarchical clustering method based
(17)
R1 (rt1 .rw1 ), H, P, h)
on word similarity appears more obvious.
2. POS tagging experiments
The probability in Eq.(17)
We used an annotated corpus selected from People’s Daily
Pi (Ri (rti , rwi )|Ri−1 (rti−1 , rwi−1 ), · · · , R1 (rt1 .rw1 ), H, P, h)
newspaper 1998 for training and testing. The corpus uses 42
may be decomposed the product of two probabilities:
tags, and has about 244974 tokens. Some properties about the
Pi (rti |rti−1 , rti−2 , · · · , rt1 , rwi )
(18)
annotated corpus are shown in Table 2.
A Part-of-speech Tagging Model Employing Word Clustering and Syntactic Parsing
113
Fig. 3. The relation between training test and tagging errors
Fig. 2. Run time comparison of two clustering algorithms
VI. Conclusions
Table
tags
1
2
3
4
5
6
7
42
2. Some properties of the annotated corpus
types
cent(%)
tokens
cent(%)
20048
89.720
162246
66.230
1934
8.655
50243
20.510
297
1.329
21419
8.743
51
0.228
9901
4.042
10
0.045
424
0.173
4
0.018
155
0.173
1
0.004
586
0.239
22345
244974
total
The baseline system is the Part-of-Speech tagging model
based on Hidden Markov model; model 1 is the POS tagging
model based on improved HMM; model 2 is the POS tagging model Integrating syntactic parsing, based on improved
HMM; model 3 is the POS tagging model Integrating syntactic parsing and word clustering, based on improved HMM. The
experimental results are demonstrated in Table 3.
Table 3. Tagging experimental results
model
baseline
Model 1
Model 2
Model 3
accuracy
94.642%
96.214%
97.126%
97.235%
From Table 3, it can be seen that the tagging method
based on improved HMM has obviously higher performance
than the conventional POS tagging method based on Hidden
Markov model under the same testing conditions; the precision is enhanced from 94.642% to 96.214%. The reason for
this improvement may be that the assumption of conditional
independence in improved HMM is more realistic than the assumption of independence in HMM. In addition, model 3 has
the highest accuracy among the four Part-of-speech tagging
models.
Next, we analyzed the relation between training corpus
and tagging accuracy. As shown in Fig.3, the relation between the scale of training corpus and the enhancement of
tagging accuracy is not linear. In general, the scale of training corpus is bigger, the obtained probability parameter will
be closer to the true language phenomenon, the tagging accuracy will also enhance. But when the scale of training corpus
reaches some degree, the enhancement amplitude of tagging
accuracy becomes smaller, and system performance improves
more and more slow. With the training size becoming smaller,
the impact of word clustering appears more obvious, avoiding
a noticeable decrease in tagging accuracy.
(1) The advent of Hidden Markov model (HMM) has
brought about a considerable progress in natural language processing and speech recognition technology. However a number
of unrealistic assumptions with HMMs are still regarded as
obstacles for its potential effectiveness. A major one is the
inherent assumption that successive observations are independent and Identical distribution (IID) within a state. In order
to overcome the defects of the classical HMM, Markov Family model, a new statistical model is introduced in this paper
and it overcomes the defects of unrealistic assumptions about
HMM. Tagging experimental results have verified the efficacy
of the proposed model, the precision is enhanced from 94.642%
to 96.214%.
(2) Traditionally, in Chinese language processing, word
segmentation, POS tagging and syntactic parsing are implemented sequentially. In this paper, we apply the strategy of integrating POS tagging and syntactic parsing in a single stage.
By decomposing and modifying the rules of Head-driven statistical models for natural language parsing, the POS Tagging
model Integrating syntactic parsing, based on improved HMM,
further enhances the precision from 96.214% to 97.126%.
(3) Cluster-based statistic language model is an important
method to solve the problem of sparse data. At least for the
Chinese POS tagging task, using the word clustering can reduce the influence of the shortage of the labeled corpus.
References
[1] Christopher D. Manning and Hinrich Schutze, Foundations
of Statistical Natural Language Processing, London, the MIT
Press, pp.136–157, 1999.
[2] K. Toutanova, D. Klein, C.D. Manning, Y. Singer, “FeatureRich Part-of-Speech Tagging with a Cyclic Dependency Network”, Proceedings of the 2003 Human Language Technology
Conference of the North American Chapter of the Association
for Computational Linguistics, Edmonton, Canada, pp.252–
259, 2003.
[3] W. Jiang, Y. Guan, X.L. Wang, “Conditional Random Fields
Based POS Tagging”, Computer Engineering and Applications,
Vol.42, No.21, pp.13–16, 2006.
[4] Jiang Tao, Yao Tianshun, Zhang Li, “Application Study of Example Based Chinese Word Segmentation and Part-of-speech
Tagging Method”, Journal of Chinese Computer Systems,
Vol.28, No.11, pp.2090–2093, 2007. (in Chinese)
[5] Eugene Charniak, Curtis Hendricson, “Neil Jacobson, and Mike
Perkowitz. Equations for Part-of-Speech tagging”, Proceedings
114
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Chinese Journal of Electronics
of the Eleventh National Conference on Artificial Intelligence,
Menlo Park, AAAI Press/MIT Press, pp.784–789, 1993.
T. Brants, “A statistical Part-of-Speech tagger”, Proceedings of
the Sixth Conference on Applied Natural Language Processing
(ANLP-2000), Seattle, pp.224–231, 2000.
Wei Ou, Wu Jian, Sun Yufang, “Analysis and Improvement of
Statistics-Based Chinese Part-of-Speech Tagging”, Journal of
Software, Vol.11, No.4, pp.473–480, 2000. (in Chinese)
Liang Yimin, Huang De-gen, “Chinese Part-of-speech Tagging
Based on Full Second-order Hidden Markov Model”, Computer
Engineering, Vol.31, No.10, pp.177–179, 2005. (in Chinese)
Qu Gang, Lu Ru-zhan, “An Improved Part-of-Speech (POS)
Tagging System”, Journal of Shanghai Jiaotong University,
Vol.37, No.6, pp.897–900, 2003. (in Chinese)
J. Gimenez, L. Marquez, “Fast and accurate part-of-speech tagging: The SVM approach revisited”, Proceedings of the International Conference on Recent Advances in Natural Language
Processing (4th RANL P), Bulgaria, pp.158–165, 2003.
Zhao Yan, Wang Xiao-long, Liu Bing-quan, Guan Yi, “Fusion
of Clustering Trigger-Pair Features for POS Tagging Based on
Maximum Entropy Model”, Journal of Computer Research and
Development, Vol.43, No.2, pp.268–274, 2006. (in Chinese)
Xing Fu-kun, Song Rou, Luo Zhi-yong, “Symbol-and-Statistics
Decoding Model and Its Application in Chinese POS Tagging”, Journal of Chinese Information Processing, Vol.24, No.1,
pp.20–24, 2010. (in Chinese)
Zhu Cong-hui, Zhao Tie-jun, Zheng De-quan, “Joint Chinese
Word Segmentation and POS Tagging System with Undirected
Graphical Models”, Journal of Electronics & Information Technology, Vol.32, No.3, pp.700–704, 2010. (in Chinese)
Yuan Li-chi, “A speech recognition method based on improved
hidden Markov model”, Journal of Central South University:
Natural Science, Vol.39, No.6, pp.1303–1308, 2008. (in Chinese)
Takuya Matsuzaki, Yusuke Miyao, Jun’ichi Tsujii, “An Efficient
Clustering Algorithm for ClassBased Language Models”, Pro-
[16]
[17]
[18]
[19]
[20]
[21]
2014
ceedings of the 7th Conference on Computational Natural Language Learning (CoNLL-2003), Edmonton, Canada, pp.119–
126, 2003.
Ido Dagan, “Context word similarity and estimation from sparse
data”, Computer Speech and Language, Vol.9, No.2, pp.123–
152, 1995.
Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John
R. Tukey, “Scatter/garther: A Cluster-based Approach to
Browsing Large Document Collections”, Proceedings of the 15th
International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92), Copenhagen,
Denmark, pp.318–329, 1992.
Lillian Lee, Similarity-Based approaches to Natural Language
Processing, Harvard University, Cambridge, MA. pp.56–72,
1997.
Yael Karov, Shimon Edelman, “Learning Similarity-Based
Word Sense Disambiguation from Sparse Data”, Proceedings
of the Fourth Workshop on Very Large Corpora, Copenhagen,
Denmark, pp.42–55, 1996.
Yuan Li-chi, “Word Clustering Based on Similarity and
Vari-Gram Language Model”, Journal of Chinese Computer ystems, Vol.30, No.5, pp.912–915, 2009. (in Chinese)
Collins M, “Head-Driven Statistical Models for Natural Language Parsing”, Computational Linguistics, Vol.29, No.4,
pp.589–637, 2003.
YUAN Lichi
was born in 1973,
Ph.D., associate professor. His research
mainly focuses on natural language processing. (Email: [email protected])