Chinese Word Segmentation based on Maximum Matching and

Chinese Word Segmentation
based on M a x i m u m M a t c h i n g and W o r d B i n d i n g Force
Pak-kwong
Wong and Chorkin Chan
D e I ) a r t m e n t o f C o m p u t e r Scien(;(~
T h e Univ(;rsil;y o f I t o n g K o n g
l ) o k f u l a m ih)a,d
thmg Kong
pkwong((~cs.hku.hk and (:chan¢~cs.hku.hk
Abstract
ally, all but one of these alternatives arc syntactically a n d / o r semantically incorrect. This is l;he
case because unlike texts in English, Chinese texl;s
have no word nlarkers. A tirst step towmds buihting a language model based on N-gram statistics is
to de,vek)p an etIMent lexical analyser to id(!ntify
all the words in the, corpus.
Word segmentation algorithlns behmg to one of
two types ill general, viz., the structural (Wang et
al., 1991) and the statistical type (Lua, 1990)(Lua
and Gan, 1994)(Sproat and Shih, 1990) rt;spectively. A structural algorithm resolves segmentation mnbiguities by examining the structural rclationships between words, while a statistical algo-rithm compares the usage flequencies of the words
and their ordered combinations inste, ad. Both approaches ln~ve serious liinitat;ions.
A Chinese word Seglnentation algorithm
based on forward icnaxinnlln matching
and word binding force is t)roposed in
this pai)er. This algorithm Iilays a key
role in post-processing the outtmt of a
character or st/eech recognizer in determining the proper word sequence c(/rrest)onding to an input line of cha.raeter
images or a speech wav(~,fol'tn. ~FO support this algorithm, a text; (:orims of over
63 millions characters is employed to enrich an 80,O00-words lexi(:on in terlns of
its word entries and word binding forces.
As it stands now, given an input line of
text, the word segmentor can proce, ss on
the average 210,000 characters per se(:ond when running on an IBM RISC System/6000 3BT workstation with a c o l
rect word identitication rate of 99.74%.
1
2
Introduction
A language model as a t)ost-processor is esse,ntial
to a recognizer of speech or characters in order
to determine the approi)riate word se,que, n(:e and
henc.e the semantics of an inI)ut line of text or utterance. It is well known that an N-gram statistics
language model is just as effective as, t)ut nmch
more eificient than, a syntactk:/semantic analyser
in determining the correct word sequence. A necessary condition to successflfl collection of N-gram
statistics is the existence of a coInprehensive le,xicon and a large text corpus. The latter must
tie lexically analysed in order to identify all the
words, from which, N-gram statistics can be derived.
About 5,000 characters are being used in modern Chinese and they are the building blocks of all
wor(ls. Ahnost every character is a word and inost
words are of one or two characters long but there
are also abundant wor(ls longer than two characters. Before it; is seginented into words, a line of
text is just a sequence of characters and there are
numerous word segmentation alternatives. Usu-
200
Maximum
Matching
Segmentation
Method
for
Maximum matching (l,iu et al., /994) is one of the
most I)opular structural segmentation algorithms
for Chinese texts. This method favours long words
an(1 is a gree(ty algorithm by (lesign, hen(:e, suboptimal. Segmenl;ation may start from either end
of the line without any difference in segmentation results. In this paper, the forward direction
is adopted. The major advantage of inaximum
matching is its etHciency while its segmentation
accuracy can be expected to lie around 95%.
3
Word Frequency
Segmentation
Method
for
In this statistical approach in terms of word frequencies, a lexicon needs not only a rich repertoire
of word entries, lint also the usage frequency of
e,ach word. To segment a line of text, each possible segmentation alternative is ewduated according to the product of the word fi'equencies of the
words Seglnented. The word sequence, with the
highest fi'equency product is accepted a.s correct.
This method is simple but its a(:curacy (h,,lmnds
heavily on the accuracy of the usage fi'equencies.
The usage frequency of a word differs greatly from
one tytm (t[ do(:umcnts to a.noth(n', say, a l)assag(:
of world news as a,ga,hlsl; a, t(',(:hnical r(~,port. Sil~c('~
(,here, a.r('. I;(uls o[ l,h(/llsa.Ii(ls o[ words a,cl;ively us('.(l,
Oil(', nc(;ds a giganti(: (:oll(~(:ti(ll: ()f t e x t s to mak(~
; m a,(',(:urat,(~ estimal;(~, lint t)y t;h(~.u, t h e (~stimat;(~
is jusl, an averag(~ a,n(l it; ma.y not; t)(,, suital)le for
any tyt)(! (/[ (h)(:mn(mt at all. /n oth(n' words, 1,}m
variml(:(~ of ml(:h &ll (~st;illl~tl;(~ is to() g r e a t m a k i n g
I;h(; (~stiirlat(! listless.
4
The Lexicon
M o s t Chines(; linguists ac(',(;1)t the (h',:linition of a
wor(1 as thc m i n i m u m unit tha,t is s c m a n t i c M l y
(',omt/h',t(~ a n d (',all lie, I)Ut; tog('%her as t/uihting
t)lo('ks to form a, sent(ulc(u llow(:vex, in Chines(:,
wor(ls can t)(~ unit(:d t(/ fi)rm (:Oml)()mM words,
a.n(l t h e y in turn, (',;/.): (:oral/in(: furth(',r 1:() t'()rm 3,('.1,
higher (>r(lcr('d (:omt)(/und words. As ;1 ma.tt(~r of
t~lC[;~ COlllI)o1111(l WOl'ds ,~LI'(;(~,xtr(un(~ly C(/IlllllOIl ;/11(~
t h e y exist in large n u m b e r s . R is imI)ossit)h', t;(/
in(:lud(', all (:Olil[)()lllld words into the, h!xi(:(m t)ut
j u s t to kcc t) (,host: which a r e ['re(tu(;nt;ly Uso(l a,n(i
ha.v(', the word (:omtl(mcnts unit('d clos(',ly. A lexicon was at:quirt;(1 from th(; Inst;il;ulx'~ o[ [nf()rm;lt;i()ii S(;i(~,ll(:(~,, Acad(',nlia, Sini(:a. in Taiwan. Thcr('~
are 78410 word (mtri(:s in l:his h~xi(:(tn, (~n(:h associa t e d with a usage frextu(;n(:y. A (:O:'lnls (/t: over 63
million (:hara('.t(:rs o[ news lines was acquired [rom
China. l)u(~ t(/ (:ultural difl'(:r(m(:(:s of tim two st)(:i('t;ios, t h e r e arc m a n y words en(:(nmt(~r(:(1 in th(:
(:()rpllS t)llt II()t in t:h(~ lexi(:on, rl'h(! lal;t(!r m u s t
t,heretbre lie em'ichcd 1)efor(~ it can 1)e a t)pli(:d 1:(/
t)(wt'orln the lexical a.nalysis. T h e tits( st, el/ t()wa,r(ls this end is to m e r g e a h~xi(:(m l/ut)lishcd in
C h i n a into this one, in(:r(',asing t h e numt)(u' of word
ent;ries to 85,855.
5
The Proposed Word
Segmentation Algorithm
Tllc t)rot/os(:d a l g o r i t h m of this t)al>(:r m a k e s use (t['
a f(/rward ma.ximmn m a t c h i n g st, ra.t(;gy to i(hultify
w()r([s, In this r(:sl)(~(:l;~ this a l g o r i t h m is a struct u r a l atll)roa(:h. (hMer this sl;ratcgy, errors are,
usually a.ssot;iated with singh',-(:haract(~r words, ill
th('~ first (:hm'a,(:ter (if a litm is i(hmtili(~d ns a single(:haract(~r word, w h a t it nlcans is that; ther(~ is
no m u l t i - c h a r a c t e r word e n t r y in the l(~xi(:on th;d;
s t a r t s w i t h such a chara(:tcr. In t h a t case, there is
not much on(, can do a b o u t it,. On t h e o t h e r h a n d ,
when a c h a r a c t e r is khmtifie, d as a single-cha.ra(:tcr
word fl following a n o t h e r word (t in th(: line, one
(:annot he,ltl wondca.'ing w h e t h e r tim sole chm'acter
(:omt)osing/~ shouhl n o t 1)(',c o m b i n e d with th(' suffix of (t to form a n o t h e r word instea.d, even il t h a t
metals ('hanging (~ int() a s h o r t e r w(tr([. In t h a t
case, every t)ossil/h~ w(/rd sO,(lll(~n(;(? a l t e r n a t i v e (:orr e s p o n d i n g to t h e Sllt)-s(:qilo, iicc of (;hari-l(:t(ws fr()ill
c~ a n d /3 t o g e t h e r will 1)e e v a l u a t e d a c c o r d i n g to
the produ(:t o[ its c o n s t i t u e n t word b i n d i n g for(',es.
201
T t l e b i n d i n g force of a. wor(l is a. rues.sure of how
s t r o n g l y the charact(',rs conll)osing th(,, word are
b o u n d t()g(~ther as a single unit;. T h i s for(x: is oL
ten e q u a t e d to tim usage fr(~qu(mcy of t h e word.
In this l'(!Sl)(:(;l;, t h e pr()l>()s(;(1 a l g o r i t l u n is a, sta.tisti(:al apl)roach. It is as (,,tti(:i(:nt as tim m a x i m u m
lna.tching moth(/(1 I)(~(;aus(! wor(l b i n d i n g f()r(:(!s ;u'(~
utilized only in (,x(:(~pti(mal cases, th)w(wer, much
of the word amt/iguities are climilmt(~d, h~a(ling
to a vc'ry high word i d e n t i f i c a t i o n accuracy. S('gm(:ntation errors ass(/ciat('d with multi-cha.ract(,,r
words can 11(: r(~(h:(:cd 1)y a d d i n g or (leh~ting woMs
to or from t h e h',xi(:on as well as a d j u s t i n g word
t)in(ling forces.
6
Structure
of the
Lexicon
W o r d s in the h:xi(-(in are divided into 5 groups
a,ccording to w o M h;ngths. T h e y corr(:spond to
words ()t' l, 2, 3, 4, and m o r e t h a n 4 cha,ra(>
ters with group sizes equal t() 7025, 53532, 12939,
11269, and 1090 rt;stmctively. Since iilOSt of tlw,
l;iule s p e n t ill mm,lyzing a line, o[ t e x t is ill linding
a m a t c h a m o n g the h;xicon (',ntries, a chwcr org a n i z a t i o n o[ the lexicon Slmcds up the s(',mching
1)rot'(~ss trclnc, lMously. M o s t Chin(,a*, words are o[
(mr: or t w o cha.racJx;rs ()lily. S e a r c h i n g for l(mg(!r
WOl:dS I)(~['Ol(: s h o l t ( w OliOS ~/s l n ' a t : t i s e d iu ma.xim u m nu:tching recalls Sl)en(ling a g r e a t ileal of
time s(,,arching for ram-existent; t a r g e t s . To overcome this p r o b l e m , I;11(', following measur(',s arc,
takc, n to organize tim h:xicon for fast s(:m'(:h:
• All sinp;h; (:ha.la.c.t,(:r w()t'(t,q a,l.O, sI;or(;d ill ;-/ I;able of 32768 bins. Since tilt; itll;Cl'llld cod(: Of
a cha.rat'tcr takes 2 bytes, bits l - 1 5 m'e used
as th(! bin a d d r e s s for the, wor:l.
• All 2-charat't(',r words are s t o r e d ill a se,parat(;
tabh: of 655"{6 bins. 'I'll(', two low order b y t e s
of t h e two (:hara(:ttn's arc used as a s h o r t i w
t:(:g(',l" for bin address. S h o u l d t]mrt~ be o t h e r
words (:ont(',sting for the, s a m e biu, they a,re
k e p t in a linked list.
• A n y 3-ttha.ra,cl;t;r word is split into a 2(',ha.ra,('.t(;r pre[ix a n d a, i[-chara(:ter sutlix. T h e
prt!lix will tm si,ored in the bin tabh: for 2(']lar~l(:t(',r words with (:lear indi(:ation of its
l)rcfix st&(liB. T h c Sill[ix will bc s t o r e d in the
bin t a b l e for l-(:harac, t(:r words, again, wiLh
clear i n d i c a t i o n of its suffix s t a t u s . All (tut/li(;ate entries are coral)trier1, i.e., if (~ is a word
as well as a suflix, tilt; two entries arc c o m bined into one with a,n i n d i c a t i o n t h a t it; can
serve as a word as well as a suffix.
• A n y d-t:haract;ex word is d i v i d e d up into a 2chara(',ttu" prefix a n d a 2-(:haract(n' suffix, 1)oth
s t o r e d in tile bin t a b l e :[or 2 - c h a r a c t e r words,
with ch:ar i n d i c a t i o n s of tll(;ir r('~spc(:tivc status. Each prefix p o i n t s to a link(;d list of ass o c i a t e d suffixes.
• Any word longer than 4 characters will be divided into a 2-character prefix, a 2-character
infix and a suffix. The prefix and tile infix are
stored in the bin table for 2-character words,
with clear indications of their status. Each
prefix points to a linked list of associated infixes and each infix in turn, points to a linked
list of associated suffixes.
Maximum matching segmentation of a sequence
of characters "...abcdefghij.. 2' at the character "a"
starts with matching "ab" against the 2-character
words table. If no match is found, then, "a" is assumed a 1-character word and m a x i m u m matching moves on to "b". If a match is found, then,
"ab" is investigated to see if it can be a prefix.
If it cannot, then "ab" is a 2-character word and
m a x i m u m matching moves on to "c". If it can,
then one examines if it can be associated with an
infix. If it can, then one examines if "cd" can be
an infix associated with "ab". If the answer is
negative, then the possibility of "abed" being a
word is considered. If that fails again, then "c"
in the table of 1-character words is examined to
see if it can be a suffix. If it; can, then "abe" will
be examined to see if can be a word by searching the 1-chara(q;er suffix linked list pointed at by
"ab". Otherwise, one has to accept that "ab" is a
2-character word and moves on to start Inatching
at "c". If "cd" can be an infix preceded by "ab",
the linked list pointed at; by "cd" as an infix will
be searched for the longest possible sutfix to combine with "abed" as its prefix. If no match can be
found, then one has to give up "cd" as an infix to
"ab':.
7
Training of the System
Despite the fact thai; the lexicon acquired from
Taiwan has been augmented with words fl'om another lexicon developed in China, when it is applied to segment 1.2 million chm'acter news passages in blocks of 10,000 characters each randomly
selected over the text corpus, an average word seginentation error rate (IZ) of 2.51% was found with
a standard deviation (c,) of 0.57%, mostly caused
by uncommon words not included in the enriched
lexicon. Then it is decided that the lexicon should
be fllrther enriched with new words and adjusted
word binding forces over a number of generations.
In generation i, n new blocks of text are picked
randomly from the corpus and words segmented
using the lexicon enriched in the previous generation. This process will stop when I* levels off over
several generations. The 100(1 - a)% confidence
interval of t* in generation i is :tzto.a~,~,__l~r/v~
where a is the standard deviation of error rates
in generation i - 1, and n is the number of blocks
to be segmented in generation i. to.5~,n-1 is the
density function of (0.5a, n - 1) degrees of freedom(Devore, 1991). Throughout the experiments
below, n is always chosen to be 20 so that the 90%
202
confidence interval (i.e., (t = 0.1) of tz is about
:k0.23%.
8
Experimental
Results
The lexicon has been updated over six generations
after being applied to word segment 1.2 million
characters. Tile vocabulary increases from 85855
words to 87326 words. The segmentation error
rates over seven generations of the training process
are shown in the table below:
Lexicon
Generation
Number
6
Error Rate 1~, over a text
of 200 000 Characters
Max. Mat;.
Max. Mat. ~
Word Bind. t~brce
5.71%
2.32%-2.16070-----5.20%
4.66%
1.88%
- 4U~Sg-0-1.69,%--2.60%
0.43%
2.4:7%
0.30%
2.44%
0.26%
Most of these errors occur in proper nouns not
included in the lexicon. They are hard to avoid
unless they become l)opular enough to be added
to the lexicon. The CPU time used for segmenting
a text; of 1,200,000 characters is 5.7 seconds on an
IBM I{ISC System/6000 3BT computer.
9
Conclusion
Lexical analysis is a basic process of analyzing and
understanding a language. The proposed algorithm provides a highly accurate and highly efficient way for word segmentation of Chinese texts.
Due to cultural differences, tile same language
used in different geographical regions and difl'crent
applications can be quite diffferent causing problems in lexical analysis. However, by introducing
new words into and adjusting word binding threes
in the lexicon, such difficulties can be greatly mitigated.
This word segmentor will be applied to word
segment the entire corpus of 63 million characters
before N-gram statistics will be collected for postprocessing recognizer outputs.
References
Jay L. Devore. 1991. Probability and Statistics
for Engineering and Sciences. Du:rbury Press,
pages 272 276.
Ynan Liu, Qiang Tan, and Kun Xu Shen. 1994.
The Word Segmentation Rules and Automatic
Word Segmentation Methods for Chinese Information Processing (in Chinese). Qing Hua Uni-
versity Press and Guang Xi Science and Tee]tnology Press, page 36.
Kim-Teng Lua mid Kok-Wcc Gan. 1994. An
Applicat;ion of ]nibrmal;ion Theory in (]hincso.
Word Segmental:ion. Comp'ttter l'~'oce,,ssi'lzg of
Ch,incsc and Or'ienlal Languages, Vol. 8, No. 1,
pages 115 123, 2unc'.
K.T. lma. 1990. From Chm'aclx',r l;o Word An
At)plication of hfformai;ion Theory. Computer
Proccssin 9 of Chinese and Oriental Languages,
Vol. 4, No. 4, pages 304 313, March.
Limtg-Jyh Wm~g, Tzusheng l'ei, Wci-(]huan IA,
and Lih-Ching 11,. Ilmmg. 1991. A Parsing
Method for hh',ntit~ying Words in Mandarin Chines(,, S('m~(,am(,~s. In l'roccssiugs of 121,h lnt(:>
'national ,loin/, Conference on Artificial httelli
gcncc, pages 1018 1.023~ l)~rrling IIarl)our, Sydney, Austr~dia, 24-30 August.
l{ictmrd Sproat ~md Chilin Shih. 1990. A St,adsdeal Method for Finding Word Boundaries in
Chinese Text. Computer l'rocessin.q of Uhinese
and Oriental Lo,n.q,tta.qes, Vol. 4, No. 4, pages
336 349, Mm'ch.
203