Tibetan Word Segmentation and POS Tagging Research Based on Knowledge Feedback
Wei Bao1,*, Luobsang Karten2
1National Language Resource Monitoring & Research Center of Minority Languages, Minzu University of China, Beijing, China
2Tibetan information technology research center, Tibet University, Lhasa, Tibet, China
Abstract. Tibetan word segmentation and POS (part-of-speech) tagging is the primary task for Tibetan information processing. In this paper,
knowledge feedback method was proposed to solve Tibetan word segmentation and POS tagging. Firstly, we constructed a Tibetan word
segmentation model with 165,100 sentences and a Tibetan POS tagging model with 356,500 sentences. Then we built Tibetan knowledge
base based on Tibetan linguistic feature, which was used to feedback the results of the word segmentation model and POS tagging model. In
open set test with 16,600 sentences for Tibetan word segmentation model and 206,100 sentences for Tibetan POS tagging model, our system
achieved good results. Precision of Tibetan word segmentation and POS tagging is 96.03% and 98.75%, which had basically met the practical conditions.
Keywords: Word segmentation, POS tagging, knowledge feedback, Tibetan, CRFs
1. Introduction
Word segmentation and POS tagging is the basic problem of Tibetan natural language processing, which is the foundation of information retrieval and semantic understanding. Different from English and many other western languages, there is no delimiter to mark word
boundaries in Tibetan, where sentences are presented as strings of Tibetans syllables. Moreover, multi-category word is common in Tibetan.
YangMaoZhuoMa collected manual-tagged corpus with more than 120,000 words, among which occurrences of multi-category word were
more than 60,481, account for 49.17%[1]. An example is listed as following:
n. Kongpo; a. collapsed.
①
②
n. women in Kongpo.
a. collapsed.
We can see that
in
and
has the same pronunciation and shape, former means “women in Kongpo”
while the latter means “collapsed”. So no delimiter and multi-category words are two challenges in Tibetan natural language processing.
At present there are three methods for Tibetan word segmentation: the method based on string match, the method based on rules and
the method based on statistics. Maximum matching method and minimum matching method are the typical methods for Tibetan word segmentation. Jiang [2] proposed the Tibetan maximum matching method. Luo and Jiang[3] concluded 36 rules based on 5 million words corpus and built the basic frame of Tibetan word segmentation. Method based on string match is simple and efficient, but fails in its limited
lexicon. Methods of rules make full use of the linguistic features of Tibetan. Chen[4] implemented the Tibetan word segmentation system
based on case-auxiliary word and continuous features. Qi [5] proposed three levels Tibetan division system, namely, division and case frame,
merging noting and marking into an organic whole system, which was based on Tibetan form logic case, semantic logic case, phonological
tendency studies. Many other Tibetan word segmentation systems based on the rules were proposed by Cai Zhijie, Sun, Norbu[6-8]and so on.
The method of statistics utilizes the statistics information between words and characters, such as word frequency, co-occurrence information. N-gram, HMM, CRFs are used to train language model. This method performs well in practical use, especially for large-scale corpus. But Tibetan language phenomenon is various and complex, large-scale corpus only is not enough. Scale of the corpus and dictionary is
crucially important for the statistics method.
Compared to the research of western languages and Chinese POS tagging, the research of Tibetan POS tagging falls behind. Most of
the researches focus on statistics model, including HMM, maximum entropy and CRFs. Su[9] researched Tibetan POS tagging based on
HMM, building dictionary based on POS and probabilistic information of words. Shi[10] used Chinese Segtag system to do Tibetan POS
tagging, and the precision was 83.17%. Long[11] proposed the training data with multi-level annotation to enhance the effects of POS tagging and experimental results showed that syllable tags could correct certain errors caused in POS tagging.
Most researches utilized the results of POS tagging model directly, paying little attention to Tibetan linguistic features. And the corpus
for experiments is not big enough for training a language model. Finally, the standard of Tibetan POS set is different. In this paper, we constructed Tibetan word segmentation model with 165,100 sentences and Tibetan POS tagging model with 356,500 sentences. And we carried
experiments in condition of training set sharing no common data with test set. Then we revised the errors in the results of word segmentation
model to build Tibetan word segmentation knowledge base. Finally, the Tibetan word segmentation knowledge base was used to feedback
CRFs word segmentation model and POS tagging model.
The rest of this paper is organized as follows. Section 2 describes the knowledge base for Tibetan word segmentation system. Section 3
presents the knowledge feedback method and the experiments. Section 4 is the conclusions.
2. Tibetan word segmentation
2.1 Tibetan word segmentation based on CRFs
There are still several difficulties for the Tibetan word segmentation based on CRFs to solve. Firstly, there is no delimiter to mark word
boundaries in Tibetan. Secondly, there are many monosyllables in Tibetan, which is called gelling word, increasing the difficulty for Tibetan
word segmentation. Thirdly, segmentation ambiguity is common in Tibetan. Finally, unknown words recognition is a challenge.
CRFs model is a type of discriminative undirected probabilistic graphical model proposed by Lafferty, which is often applied in natural
language processing. In this paper, we implemented a Tibetan word segmentation system based on CRFs model.
2.2 Feature Template
“BIES” tagging method is widely used in Tibetan word segmentation. “B” denotes the start syllable of a Tibetan word, “I” denotes the
internal syllable of a Tibetan word, “E” denotes the end syllable of a Tibetan word, and “S” denotes a Tibetan word with one syllable. In this
paper, we add two new tags, “Eg” and “Sg”, which denote end syllable in gelling form and one syllable word in gelling form. An example is
listed as following:
Journal of Residuals Science & Technology, Vol. 13, No. 8, 2016
© 2016 DEStech Publications, Inc.
doi:10.12783/issn.1544-8053/13/8/111
111.1
Sentence:
Tagging:
Feature function contains two parts, atomic feature and compound feature. Atomic feature should contain an observe unit. In this paper,
current syllable, first two syllables and latter two syllables constitute the atomic feature. Center word is the basic character in Tibetan. Feature template used in experiments is listed in table 1.
Table 1 Feature Template for Word Segmentation
Feature
Description
W-2
first two syllables of the center word
W-1
first one syllable of the center word
W0
the center word
W1
latter one syllable of the center word
W2
latter two syllables of the center word
2.3 Tibetan word segmentation based on knowledge feedback
Tibetan word segmentation based on CRFs has achieved good performance, but some errors caused by Tibetan various linguistic features still exist. We proposed the feedback method to revise the errors of CRFs word segmentation model, mainly focused on non-Tibetan
characters segmentation error, Tibetan gelling word recognition error, high frequency words segmentation error and name entity recognition
error.
(a)
Non-Tibetan characters segmentation error
CRFs segmentation result:
(Meaning: Warmly celebrate the 65th anniversary of the founding of new China)
Correct result:
Error result:
should be segmented
This error was caused by two reasons. There were some non-Tibetan characters in corpus, and we trained a CRFs model based on Tibetan syllable sequences. So when raw data analyzed by CRFs model, Tibetan characters and non-Tibetan characters were treated as one
syllable, which was outside the training set. Another reason was that training set contained some syllables consisted by Tibetan characters
and non-Tibetan characters. So we made a rule to avoid this error.
Rule 1:
If wi∈U (i≠0), cut wi from set S.
“S” denoted Tibetan sentences needed to be segmented. “U” denoted non-Tibetan character set. U={D, E, C, P}, “D” was the set of
time and figure such as “123”, “3.14”, “30%”. “E” and “C” was the set of English characters and Chinese characters. And “P” was the set of
punctuation. Rule 1 could avoid the error of treating the Tibetan character and the non-Tibetan character as one syllable.
(b)
Tibetan gelling word recognition error
High repetition rate is the characteristic of gelling words, so we used word frequency to resolve the error of Tibetan gelling word recognition. Firstly, we counted the frequency of gelling words in big-scale corpus, in total, 101265 gelling words were found. After duplicate
words were removed, 305 syllables were left, which reflected the high repetition rate of the gelling words. We calculated the rate of each
syllable containing gelling words, fc was calculated as following.
fc= frequency of a word occurred as gelling syllable / frequency of the word
Table 2 The first nine gelling words with high frequency
frequency of the word
syllable
frequency of the word occurred as gelling syllable
fc
in corpus
25808
25301
0.98
10916
6471
0.59
བའི་
9032
8978
0.99
བར་
4426
2526
0.75
པས་
3702
3473
0.93
བས་
3560
2763
0.77
རྒྱུར་
3349
3346
0.99
པོར་
2596
2579
0.99
ཚོར་
2329
1736
0.74
Table two listed the first nine gelling words with high frequency. We can see that fc of the first gelling word “པའི་” is 98%, which reflects that this word occurs as gelling word in most cases. So we used fc as a threshold value to judge a word was gelling word or not. Finally,
we set Rule 2.
Rule 2:
If wi∈N (i≠0) and fc>f, wi is judged as gelling word.
“N” denoted the set including gelling words, “nj” denoted the elements in N. “f” was the threshold value in experiments. We calculated
f though repeated experiments and found when f=55, experimental results performed best.
(c) High frequency words segmentation error
High frequency words in this paper included as following.
Tibetan case-auxiliary word: “གི་”, “ཀི་”, “གིས་”, “ཀིས་”;
Tibetan punctuation: “༄༅།།”, “།།”, “།”;
Tibetan figures: “གསུམ་”, “༡༣”, “ཞེ་གཅིག་”, “ས་ཡ་”;
Tibetan temporal words: “ས་ག་ཟླ་བ་”, “རབ་བྱུང་བརྒྱད་པ་”, “ལྕགས་རྟ་ལོ་”;
Usually case-auxiliary word could not be segmented from the other words. So we built Tibetan high frequency words table and set Rule
3 to avoid this kind of errror. “SW” denoted the set of high frequency words table.
Rule 3:
Journal of Residuals Science & Technology, Vol. 13, No. 8, 2016
© 2016 DEStech Publications, Inc.
doi:10.12783/issn.1544-8053/13/8/111
111.2
If wi∈SW (i≠0), cut wi from set S.
(d) Name entity recognition error
CRFs segmentation result:
ཕོ་བྲང་/པོ་ཏ་ལ/འི་/རྒྱབ་/ཀི་/རོང་/རྒྱབ་ཀླུ་ཁང་/ནང་/མེ་ཏོག་/ཁྲ་ཆིལ་ཆིལ་/དུ་/བཞད་/།
(Meaning: Warmly celebrate the 65th anniversary of the founding of new China)
Correct result: ཕོ་བྲང་/པོ་ཏ་ལ/འི་/རྒྱབ་/ཀི་/རོང་རྒྱབ་ཀླུ་ཁང་/ནང་/མེ་ཏོག་/ཁྲ་ཆིལ་ཆིལ་/དུ་/བཞད་/།
Error result: “རོང་/རྒྱབ་ཀླུ་ཁང་/” should be segmented “རོང་རྒྱབ་ཀླུ་ཁང་/”
This error was caused by unknown words were outside the training set. So we built Tibetan named entity corpus, including name, gazetteer and organization. And Rule 4 was set as following.
Rule 4:
If wi∈T (i≠0), cut wi from set T.
“T” denoted the common named entity corpus and “tj” was the element in T.
2.4 Experiments and results
Experimental data was mainly collected from Tibet's news, People's Daily (Tibetan channel), Qinghai Tibetan radio and Xinhua net,
which covered news, entertainment, poetry, culture, religion well. Training data and test data were listed in table 2. We calculated that a
sentence contained 15 words and one sentence approximately equal to 4.7KB. In this paper, we use sentence as unit to make convenient for
experiments comparison.
In total, we collected 165,100 sentences for Tibetan word segmentation, in which 148,500 sentences for training set and 16,600 for test
set, among which share no common data.
Table 3 Performance of Tibetan word segmentation systems available
System
Precision
Training data (sentence)
Test data (sentence)
SegT[10]
96.98%
3,000
1,000
Sun[7]
95.70%
12,942
500
Huaquecairang[13]
92.12%
22,000
573
Kang[14]
91.27%
66,667
1,333
Long[11]
89.9%
15,900
3,987
CRFs model
94.71%
148,500
16,000
Feedback method
96.03%
148,500
16,000
In the condition of open set test, precision, recall and F-value of knowledge feedback method was 96.03%、96.11%、96.06%, which
was higher than CRFs model. Table 3 listed the performance comparison between our methods and Tibetan word segmentation systems
available. Compared to the other Tibetan word segmentation systems, our method had a better performance.
3. POS tagging system
3.1 Tibetan POS Tagging based on CRFs
Multi-category word and unknown words are two difficulties for Tibetan POS tagging. How to conform the POS of the multi-category
word and the unknown word with context is important. In this paper, we designed a Tibetan POS tagging system based on CRFs and selected well-performed features in condition of big-scale training data.
In this paper, we used “Standardized Set of Tibetan POS Markers for Computational Uses” published by State Language Commission[12] in 2015. Considering the practicability of Tibetan POS tagging system, we labeled the second category for noun and the first category for the others.
3.2 Feature template
To analyze the performance of each feature, we proposed a set of feature template for each feature. “U03:%x[0,0]” is the template of
“w-0” feature, which denotes the Tibetan center word. “U02:%x[-1,0]” is the template of “w-1” feature, which denotes the first word of
Tibetan center word. By that analogy, extend outward from the center word. Feature template is listed in table 4.
To select optimal feature template for Tibetan POS tagging, we tested different feature template in open set test and close set test. Feature template used in experiments was shown in table 5. “CRF template A” contained “U02”, which denoted the feature of “Tibetan center
word”. “CRF template B” contained “U01” and “U02”, which denoted “Tibetan center word” and “the first two words of Tibetan center
word”, etc.
Table 4 Feature Template for Tibetan POS Tagging
No. Feature
CRF template
1
w-3
U00:%x[-3,0]
2
w-2
U01:%x[-2,0]
3
w-1
U02:%x[-1,0]
4
w0
U03:%x[0,0]
5
w1
U04:%x[1,0]
6
w2
U05:%x[2,0]
7
w3
U06:%x[3,0]
8
w-3w-2 U07:%x[-3,0]/%x[-2,0]
9
w-2w-1 U08:%x[-2,0]/%x[-1,0]
10
w-1w0
U09:%x[-1,0]/%x[0,0]
11
w0w1
U10x[0,0]/%x[1,0]
12
w1w2
U11x[1,0]/%x[2,0]
13
w2w3
U12:%x[2,0]/%x[3,0]
14
w-1w1
U13:%x[-1,0]/%x[1,0]
15
w-2w2
U14:%x[-2,0]/%x[2,0]
16
w-3w3
U15:%x[-3,0]/%x[3,0]
Journal of Residuals Science & Technology, Vol. 13, No. 8, 2016
© 2016 DEStech Publications, Inc.
doi:10.12783/issn.1544-8053/13/8/111
111.3
No.
A
B
C
D
E
F
G
H
I
J
K
L
Table 5 CRF Feature Template
CRF template
U02
U01;U02
U01;U02;U03
U01;U02;U03;U04
U00;U01;U02;U03;U04
U00;U01;U02;U03;U04; U05
U00;U01;U02;U03;U04; U05;U06;
U00;U01;U02;U03;U04; U05;U06;U07;
U00;U01;U02;U03;U04; U05;U06;U07;U08;
U00;U01;U02;U03;U04; U05;U06;U07;U08;U09;
U00;U01;U02;U03;U04; U05;U06;U07;U08;U10
U00;U01;U02;U03;U04; U05;U06;U07;U08;U10; U11
Used the feature template in table 5, we carried a series of experiments and the results were shown in figure 1.
Figure. 1. Performance of different CRF feature template.
We can see that when add a feature, precision of POS tagging system is increasing. Precision, recall and F-value of K feature template
is 95.43%, 94.89% and 95.11%, which attains the best performance. Precision, recall and F-value of CRF feature template L is lower than
feature template K. So in our experiments, feature template L attains the best performance.
3.3 Tibetan POS tagging based on knowledge feedback
By analyzing the results of CRFs POS tagging, we found that error rate of verb and adjective was higher and numerals and punctuation
POS tagging errors still existed. For Tibetan, numerals and punctuation was a complete set, so we collected Tibetan numerals and punctuation, and the knowledge in Tibetan word segmentation system. Then a feedback algorithm was designed to revise the result of CRFs Tibetan
POS tagging to improve the performance.
155,335 sentences for training and 2,120 sentences for test were used for Tibetan POS tagging experiments. Results are listed in table 6.
Table 6 Experimental Performance Comparison
System
Precision
Training data (sentence)
Test Data (sentence)
Huaquecairang[13]
98.26%
8,240
573
Yangjin[10]
91.00%
12,690
345
Kang[14]
87.76%
13,333
400
TIP-LAS[15]
93.90%
141,333
4,240
CRFs model
99.67%
155,335
2,120
Feedback method
98.75%
155,335
2,120
From table 6, we can see that our feature template show good performance and knowledge feedback method improve the precision.
Compared to the other system, our system has better performance and larger training data and test data.
4. Conclusion
Combining statistic methods and Tibetan linguistic features, we proposed a knowledge feedback method to improve the performance of
CRFs model. Open set test with 16,600 sentences for Tibetan word segmentation model and 206,100 sentences for Tibetan POS tagging
model achieved better performance. Precision of Tibetan word segmentation model and POS tagging model is 96.03%, 98.75%, which has
basically met the practical conditions.
Acknowledgment
This work is supported by the National Science Foundation of China (61331013).
References
[1] Yangmo Droma “Study on method of solving ambiguity in Tibetan part of speech tagging,” Computer Engineering and Applications,
vol. 24, pp. 135–148, 2013.
[2] D. Jiang, Y. H. Dong, “Research on Property of Tibetan Characters as Information Processing”, Journal of Chinese Information Processing, vol.02, pp.37–44, 1999.
[3] B. F. Luo, D. Jiang. “Basic Rule of Tibetan Computer Automatic Word Segmentation,” Chinese minority language modernization, 1999.
Journal of Residuals Science & Technology, Vol. 13, No. 8, 2016
© 2016 DEStech Publications, Inc.
doi:10.12783/issn.1544-8053/13/8/111
111.4
[4] Y. Z. Chen, B. L. Li, S. W. Yu, “The Design and Implementation of a Tibetan Word Segmentation System,” Journal of Chinese Information Processing, vol. 15-20, pp.15-20, 2003.
[5] K. Y. Qi, “Research of Tibetan Word Segmentation for Information Processing,” Journal of Northwest University For Nationalities(Philosophy and Social Science), vol. 04, pp. 92-97, 2006.
[6] Cai Zhijie, “Design and Implemention of Banzhida Tibetan Word Segmentation System,” Journal of Minorities Teachers College of
Qinghai Teachers University, vol. 02, pp. 75-77, 2010.
[7] Y. Sun, X. D. Yan, X. B. Zhao, et al. “A resolution of overlapping ambiguity in Tibetan word segmentation,” 3rd International Conference on Computer Science and Information Technology, pp. 222-225, 2010.
[8] Norbu S, Choejey P, Dendup T, et al. “Dzongkha word segmentation,” Proceedings of the 8th Workshop on Asian Language Resources.
pp. 95-102, 2010.
[9] J. F. Su, K. Y. Qi, Ben Tai. “Research of Part of Speech Tagging for Tibetan Texts Based on Hidden Markov Model,” Journal of
Northwest University For Nationalities(Philosophy and Social Science), vol. 01, pp. 42-45, 2009.
[10] X. D. Shi, Y. J. Lu. “A Tibetan Segmentation System-Yangjin,” Journal of Chinese Information Processing, vol. 25, pp. 54-56, 2011.
[11] C. J. Long, H. D. Liu, Nuo Minghua, J. Wu. “Tibetan POS Tagging Based on Syllable Tagging,” Journal of Chinese Information Processing, vol. 28, pp. 211-215, 2015.
[12] Standardized Set of Tibetan POS Markers for Computational Uses, State Language Work Committee, 2015.
[13] M. Sun, H. Q. Cairang, J. Caizhi, W. B. Jiang, Y. J. Lv, Q. Liu. “Tibetan Word Segmentation Based on Discriminative Classification
and Reranking,” Journal of Chinese Information Processing, vol. 28, pp. 61-65, 2014.
[14] C. J. Kang. “Tibetan Word Segmentation and POS Tagging Research” Shanghai Normal University, 2014.
[15] Y. C. Li, J. Jiang, Y. J. Jia, H. Z. Yu. “TIP-LAS: An Open Source Toolkit for Tibetan Word Segmentation and POS Tagging,” Journal
of Chinese Information Processing, vol. 29, pp. 203-207, 2015.
Journal of Residuals Science & Technology, Vol. 13, No. 8, 2016
© 2016 DEStech Publications, Inc.
doi:10.12783/issn.1544-8053/13/8/111
111.5
© Copyright 2025 Paperzz