Automatic Recognition and Mining of Chinese Synonyms for

! 信息系统 #
ITA
●Lu Yong (Dept . of Lib. & Info. Sci . ,Nanjing Agri . Univ. , Jiangsu , 210095 ; President Office ,Nanjing Coll . of
Info. Sci . & Tech. , Jiangsu , 210044)
Hou Hanqing (Dept . of Lib. & Info. Sci . ,Nanjing Agri . Univ. , Jiangsu , 210095)
Automatic Recognition and Mining of Chinese Synonyms for Information Retrieval
Abstract : The paper presents two methods to enhance the ability to mine the synonyms automatically. The first
method is the PageRank algorithm based on the definitions in the dictionary , we analyze the relation links between given
words and the other words , then construct the associated word graph , and finally use the PageRank algorithm to calculate
the similarity degree and discover the synonyms in the associated word graph. The second method is the pattern matching
algorithm based on the patterns of the definitions in the dictionary , we form some mining rules manually , then the system
mines the synonyms by pattern matching method automatically. In addition , we use the pattern matching algorithm to mine
the synonyms from the web and the text of the periodical articles in economic area. The mining practice of financial dic2
tionaries shows that the precisions of PageRank algorithm and pattern matching algorithm reach 8516 % and 90 % respec2
tively. The test result indicates that the system is feasible and practical.
Keywords : Chinese synonyms ; automatic recognition ; automatic mining ; pattern matching ; PageRank algorithm
1 Introduction
of the meaning between words. Usually , similar words tend to
have the same contexts or the similar word tends to occur near each
There are many applications of synonyms recognition and min2
other. Richardson [1994 ] used the cosine function to compute the
ing in the field of Natural Language Processing. For example , in
similarity of the word vector space. Turney [ 2001 ] used the re2
information retrieval they can be used to broaden and modify natu2
sults of AltaVista search engine and near operator to estimate the
ral language queries , enhancing the recall of search results ( im2
frequency of words occurring near one another.
proving search recall) . They can also be used as a support for the
213 WordNet distance approach
compilation of post2control vocabulary. In database , by using
WordNet organizes nouns and verbs into hierarchical trees.
synonyms recognition and mining technology , natural vocabulary
The concepts in WordNet consist of a list of synonymous word forms
can be converted to controlled vocabulary , helping users construct
and the semantic pointers describe the relationships between the
the search queries.
current concept and other concepts. The measurement of the simi2
2 Related Work
larity or semantic distance is based on the structure and content of
WordNet. The semantic distance between two words determines
In aboard , there is no special research on the automatic
the length of the path connecting the two words in WordNet. The
recognition and mining of synonyms. At present , the associate2
greater the semantic distance between the two words in the Word2
d research includes the edit2distance approach , statistical ap2
Net hierarchy is , the less the semantic similarity is.
proach and WordNet distance approach.
211 Edit2distance approach
These studies were concerned with mining the similar word ,
co2occurrence word , not synonyms.
Synonyms have the similar character forms. Edit2distance
In China , many approaches for the automatic recognition and
approach is widely used in determining the string similarity and
mining of Chinese synonyms have been proposed , for instance ,
calculating the word similarity scores , that is , the minimum num2
literal similarity2based algorithm , word unit semantic dictionary2
ber of insertions , deletions and substitutions required to transform
based algorithm and semantic thesaurus2based algorithm. Literal
one string into the other.
similarity2based algorithm utilizes the principle of literal similarity
212 Statistical approach
among Chinese synonyms. Later , the algorithm of literal similari2
The context environment can be used to describe the similarity
— 472
—
ty has weighted the functions of each word element in expressing
・
第 29 卷 2006 年第 4 期・
! 信息系统 #
ITA
the meaning of a word. In the word unit semantic dictionary2based
they are not synonyms actually. Each rule consists of prefix2char2
algorithm , compound words (phrase) are cut to word units by the
acters and postfix2characters ( See Figure 1) .
word unit semantic dictionary , and transformed to semantics
codes , then their semantic similarity is compared. Semantic the2
< Prefix >
同义词
< Postfix >
saurus2based algorithm uses the structure of the semantic thesaurus
to compute the length of the path to determine the similarity of
words. A number of Chinese semantic thesaurus resources are
now available , such as Hownet , Tong Yi Ci Ci Lin , etc.
of synonyms , this paper brings forward two new methods based on
the dictionary : pattern matching method and PageRank method.
3 Pattern Matching Method
word
< Postfix > punctuation
This extraction rule shows that if the word ’s left adjacent
content is 简称 or 又称 or 亦称 and word ’s right adjacent
In dictionary , the nominal definition mode can be divided in2
to the synonyms definition mode , antonym definition mode and ex2
ample and divide definition mode. Synonyms definition mode uses
the word of the similar or identical meaning to define a given word.
Antonym definition mode uses the word of the opposite meaning to
define a given word and the example and divide definition mode us2
es the narrow word to define a word.
In information retrieval , the broad sense of synonyms rela2
equivalence
For example :
< Prefix > 简称| 又称| 亦称
In order to improve the performance of recognition and mining
tionship includes
Figure 1 : The form of rule
relationship
( synonyms
and
antonyms) and hierarchical relationship (whole2part relationship) .
Each mode of the definition has the predecessor word , which
indicates the relationship between words. For example , in syn2
onyms definition mode , there are many words such as“简称”,
“也称”and so on. Using these indicator word , we define some
extraction rules and use predefined rules to extract word with defi2
nite relationship automatically. These extracted words are syn2
onyms in broad sense. Some indicator words of each mode of defi2
nition in dictionary are listed in Table 1.
content is punctuation ( such as comma , full stop) , then we can
extract the word as synonyms.
Traditionally , synonyms relationship ( equivalence relation2
ship and hierarchical relationship) is a binary relation between two
words. It is assumed to be symmetric and transitive. Using the
character of the relationship , we can infer more new synonyms.
For example :
X s y , y s Z then X s Z
If word x and y are synonyms , y and z are synonyms , then
word x and z are synonyms.
4 Page Rank Method
PageRank algorithm based on the definition in dictionary is
based on the assumption that synonyms have many words in com2
mon in their definitions and are used in the definition of many com2
mon words. In the PageRank algorithm , we analyze the relation
links between a given word and the other words , then construct the
associated word graph. Finally , we convert the associate word
graph into adjacent matrix , and use the PageRank algorithm to
calculate the similarity degree and discover the synonyms in the as2
Table 1 Some indicator words of each mode of definition
in dictionary
Definition mode
sociated word graph.
The overall process is described in the following steps ( see
Figure 2 and Figure 3) :
Indicator word
Setp 1 : Use the traditional dictionary2based method ( Maxi2
Synonyms definition mode
简称 ; 又 称 ; 又 称 为 ; 亦
称 ; 参见 ; 见 ; …的简称
mum Matching Method) to segment the definition of the word. The
Antonyms definition mode
…的对称 ; 相对于 …而言
Chinese word is matched to the dictionary in memory by Binary
主 要 形 式 有 …; 包 括 …;
分为 …
Search so as to save time.
Example and divide definition mode
Step 2 : Analyze the relation links between a given word and
the other words , then construct the associated word graph. Each
In addition , we also define some exclusive rules in order to
associated word is a vertex of the graph and there is an edge from u
improve the results of the rule matching. These exclusive rules are
to v , if u appears in the definition of v. The associated word
used to eliminate the words which match the extraction rules but
graph includes the words which point to the given word or are
・
情报理论与实践・
— 473
—
! 信息系统 #
ITA
pointed to by the given word.
5 Results
We build the synonyms mining system with platform of Visual
Basic. Net. In order to be able to evaluate the system perfor2
mance , we examine the result given by the system. The mining
practice of financial dictionaries shows that the precisions of
PageRank algorithm and pattern matching algorithm reach 8516 %
and 90 % respectively. The test result indicates that the system is
feasible and practical ( see Table 3 and Table 4) .
Table 3
Figure 2 Graph of word“市场汇率”
Pattern matching extraction result
in the definitions of 700 words
0 0 0 0 0 0 0 0 Extract rule
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 Precision Recall
words extracted
words extracted
not extracted
Synonyms
143
5
25
9616 % 8216 %
Antonyms
24
0
4
100 % 8517 %
17
2
28
8915 % 36 %
Narrow
0 0 0 1 0 0 0 0 Number of correct Number of error Number of words
word
0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 Table 4
1 1 1 0 0 0 0 0 PageRank extraction result in
the definitions of 800 words
Figure 3 Adjecent matrix of word“市场汇率”
Step 3 : Convert the graph into adjacent matrix. Applying the
Number of
Number of
words
correct words
extracted
extracted
Proportion
Two2character words
12
10
PageRank formula to compute the PageRank score for each word in
Three2character words
41
32
78 %
the word graph ( see Table 2) .
Four2character words
326
291
8913 %
PR (A) = (1 - d) + d ( PR ( T1) / C ( T1) + . . .
PR ( Tn) / C ( Tn) )
+
( PageRank formula)
PR (x) is the PageRank of x , C ( x) is the number of out2
8313 %
Five2character words
140
106
7517 %
Six2character words
186
158
8419 %
Seven2character words
120
109
9018 %
825
706
8516 %
bound links on word x , d is a damping factor set between 0 and 1.
The experimental system also needs further improvement ,
such as the methods of pattern gathering , the construction of the
word2derived dictionary and the size & scope of the test set. □
Table 2 PageRank score of each word in the graph
外汇 供求
外汇汇率
汇率 市场汇率 自由兑换
浮动汇率 均衡汇率
市场 关系
管制
1
1
1
1
1
1
1
1
019875
01925
019625
019875
019875
Reference
1 Higgins D. Which Statistics Reflect Semantics ? Rethinking Synonymy
1 0185
0185 0185
2 0185
0185 0185 01966875 019240625 019559375 01966875 01966875
3 0185
0185 0185 01966875 019240625 019559375 01966875 01966875
2 Ristad E S , Yianilos P N. Learning String Edit Distance. In : IEEE
4 0185
0185 0185 01966875 019240625 019559375 01966875 01966875
Transactions on Pattern Analysis and Machine Intelligence , 1998 (5)
and Word Similarity. International Conference on Linguistic Evidence ,
Tü
bingen , Germany , 2004
3 Edmundson H P. Axiomatic Characterization of Synonymy and Antonymy.
International Conference on Computational Linguistics , Grenoble , 1967
Step 4 : Set the threshold criteria to select the most similar
4 Edmundson H P. Computer2aided Research on Synonymy and
word as synonyms. The criteria includes : PageRank score , the
Antonymy. International Conference on Computational Linguistics ,
number of output synonyms and literal similarity.
Stockholm , 1969
— 474
—
・
第 29 卷 2006 年第 4 期・
! 信息系统 #
ITA
5 Turney P D. Mining the Web for Synonyms : PMI - IR versus LSA
on TOEFL.
The
13 宋明亮 . 汉语词汇字面相似性原理与后控制词表动态维护研
究 . 情报学报 , 1996 (4)
European Conference on Machine Learning
14 吴志强 . 经济信息后控制词表的研究 : [ 学位论文 ] . 南京 :
( ECML2001) , Freiburg , Germany , 2001
6 Senellart P P. Extraction of Information in Large Graphs. Automatic
Search for Synonyms. Technical Report 90 , 2001
南京农业大学 , 1999
15 许勇 . 基于百科词典的知识获取系统的研究与实现 . 北京 :
7 Senellart P P , Blondel V D. Automatic Discovery of Similar Words.
In : Survey of Text Mining. New York : Springer2Verlag , 2003
北京工业大学 , 2001
16 章成志 . 基于文本层次模型的 Web 概念挖掘研究 —
—
—基于概
8 Richardson R , Smeaton A F , Murphy J . Using WordNet as a Knowl2
念语义网络的自动标引和自动分类研究 :
edge Base for Measuring Semantic Similarity between Words. In : Pro2
ceedings of AICS Conference. Dublin : Trinity College , 1994
[ 学位论文 ] . 南
京 : 南京农业大学 , 2002
17 朱毅华 . 智能搜索引擎中的同义词识别算法研究 : [ 学位论
9 Blondel V D , Senellart P P. Automatic Extraction of Synonyms in a
Dictionary. In : Proceedings of the SIAM Text Mining Workshop ,
文 ] . 南京 : 南京农业大学 , 2001
Brief Introduction to Authors :
Arlington , VA : [ s. n. ] , 2002
Lu Yong , born in 1979 , master. Major research areas : in2
10 贾爱平 . 科技文献中术语定义的语言模式研究 : [ 学位论文 ] .
telligent information processing.
北京 : 北京语言文化大学 , 2002
Hou Hanqing , male , born in 1943 , professor , doctoral su2
11 陆勇 , 侯汉清 . 用于信息检索的同义词自动识别及其进展 .
pervisor. Major research areas : intelligent information processing ,
南京农业大学学报 ( 社会科学版) , 2004 (3)
12 陆勇 , 侯汉清 . 基于词典注释的汉语同义词自动识别 . 见 :
第一届全国信息检索与内容安全学术会议 论 文 集 . 上 海 :
information retrieval.
Received on Mar. 9 , 2006
[ 出版者不详 ] , 2004
面向信息检索的汉语同义词自动识别和挖掘
陆 勇
( 南京农业大学信息管理系 江苏 210095 ; 南京信息工程大学校办 江苏 210044)
侯汉清
( 南京农业大学信息管理系 江苏 210095)
摘 要 : 为了提高同义词自动挖掘的效率 , 本文提出了从词典释义中自动识别和挖掘同义词的方法 , 使用超链接分
析算法和模式匹配算法 , 从不同的角度提取同义词 : 第一部分是把词汇之间注释与被注释的关系看成是一种链接关系 ,
对给定的词汇进行分析 , 把与给定词汇具有链接关系的所有相关词汇构造一个词汇图 , 图中的每一个节点代表相关词 ,
每条弧代表了词汇之间注释与被注释的关系 。利用超链接分析方法并结合 PageRank 算法 , 计算词汇的 PageRank 值 , 把
PageRank 值看成是体现词汇之间语义相似性的衡量指标 , 最后为每一个词汇生成候选同义词集 , 并通过一定的筛选原则
和方法 , 推荐出最佳的同义词 。第二部分是利用词汇定义模式 , 对词汇的释义方式进行分析 , 归纳总结出在词典释义中
同义词出现的模式 , 进而利用模式匹配方法识别和挖掘同义词 。此外 , 利用模式匹配方法对 Web 网页和期刊论文中的同
义词也进行了挖掘测试 。测试结果表明 , 利用模式匹配和超链接分析方法来自动识别和挖掘同义词具有可行性和实用
性。
关键词 : 汉语同义词 ; 自动识别 ; 自动挖掘 ; 模式匹配 ; PageRank 算法
・
情报理论与实践・
— 475
—