! 信息系统 # ITA ●Lu Yong (Dept . of Lib. & Info. Sci . ,Nanjing Agri . Univ. , Jiangsu , 210095 ; President Office ,Nanjing Coll . of Info. Sci . & Tech. , Jiangsu , 210044) Hou Hanqing (Dept . of Lib. & Info. Sci . ,Nanjing Agri . Univ. , Jiangsu , 210095) Automatic Recognition and Mining of Chinese Synonyms for Information Retrieval Abstract : The paper presents two methods to enhance the ability to mine the synonyms automatically. The first method is the PageRank algorithm based on the definitions in the dictionary , we analyze the relation links between given words and the other words , then construct the associated word graph , and finally use the PageRank algorithm to calculate the similarity degree and discover the synonyms in the associated word graph. The second method is the pattern matching algorithm based on the patterns of the definitions in the dictionary , we form some mining rules manually , then the system mines the synonyms by pattern matching method automatically. In addition , we use the pattern matching algorithm to mine the synonyms from the web and the text of the periodical articles in economic area. The mining practice of financial dic2 tionaries shows that the precisions of PageRank algorithm and pattern matching algorithm reach 8516 % and 90 % respec2 tively. The test result indicates that the system is feasible and practical. Keywords : Chinese synonyms ; automatic recognition ; automatic mining ; pattern matching ; PageRank algorithm 1 Introduction of the meaning between words. Usually , similar words tend to have the same contexts or the similar word tends to occur near each There are many applications of synonyms recognition and min2 other. Richardson [1994 ] used the cosine function to compute the ing in the field of Natural Language Processing. For example , in similarity of the word vector space. Turney [ 2001 ] used the re2 information retrieval they can be used to broaden and modify natu2 sults of AltaVista search engine and near operator to estimate the ral language queries , enhancing the recall of search results ( im2 frequency of words occurring near one another. proving search recall) . They can also be used as a support for the 213 WordNet distance approach compilation of post2control vocabulary. In database , by using WordNet organizes nouns and verbs into hierarchical trees. synonyms recognition and mining technology , natural vocabulary The concepts in WordNet consist of a list of synonymous word forms can be converted to controlled vocabulary , helping users construct and the semantic pointers describe the relationships between the the search queries. current concept and other concepts. The measurement of the simi2 2 Related Work larity or semantic distance is based on the structure and content of WordNet. The semantic distance between two words determines In aboard , there is no special research on the automatic the length of the path connecting the two words in WordNet. The recognition and mining of synonyms. At present , the associate2 greater the semantic distance between the two words in the Word2 d research includes the edit2distance approach , statistical ap2 Net hierarchy is , the less the semantic similarity is. proach and WordNet distance approach. 211 Edit2distance approach These studies were concerned with mining the similar word , co2occurrence word , not synonyms. Synonyms have the similar character forms. Edit2distance In China , many approaches for the automatic recognition and approach is widely used in determining the string similarity and mining of Chinese synonyms have been proposed , for instance , calculating the word similarity scores , that is , the minimum num2 literal similarity2based algorithm , word unit semantic dictionary2 ber of insertions , deletions and substitutions required to transform based algorithm and semantic thesaurus2based algorithm. Literal one string into the other. similarity2based algorithm utilizes the principle of literal similarity 212 Statistical approach among Chinese synonyms. Later , the algorithm of literal similari2 The context environment can be used to describe the similarity — 472 — ty has weighted the functions of each word element in expressing ・ 第 29 卷 2006 年第 4 期・ ! 信息系统 # ITA the meaning of a word. In the word unit semantic dictionary2based they are not synonyms actually. Each rule consists of prefix2char2 algorithm , compound words (phrase) are cut to word units by the acters and postfix2characters ( See Figure 1) . word unit semantic dictionary , and transformed to semantics codes , then their semantic similarity is compared. Semantic the2 < Prefix > 同义词 < Postfix > saurus2based algorithm uses the structure of the semantic thesaurus to compute the length of the path to determine the similarity of words. A number of Chinese semantic thesaurus resources are now available , such as Hownet , Tong Yi Ci Ci Lin , etc. of synonyms , this paper brings forward two new methods based on the dictionary : pattern matching method and PageRank method. 3 Pattern Matching Method word < Postfix > punctuation This extraction rule shows that if the word ’s left adjacent content is 简称 or 又称 or 亦称 and word ’s right adjacent In dictionary , the nominal definition mode can be divided in2 to the synonyms definition mode , antonym definition mode and ex2 ample and divide definition mode. Synonyms definition mode uses the word of the similar or identical meaning to define a given word. Antonym definition mode uses the word of the opposite meaning to define a given word and the example and divide definition mode us2 es the narrow word to define a word. In information retrieval , the broad sense of synonyms rela2 equivalence For example : < Prefix > 简称| 又称| 亦称 In order to improve the performance of recognition and mining tionship includes Figure 1 : The form of rule relationship ( synonyms and antonyms) and hierarchical relationship (whole2part relationship) . Each mode of the definition has the predecessor word , which indicates the relationship between words. For example , in syn2 onyms definition mode , there are many words such as“简称”, “也称”and so on. Using these indicator word , we define some extraction rules and use predefined rules to extract word with defi2 nite relationship automatically. These extracted words are syn2 onyms in broad sense. Some indicator words of each mode of defi2 nition in dictionary are listed in Table 1. content is punctuation ( such as comma , full stop) , then we can extract the word as synonyms. Traditionally , synonyms relationship ( equivalence relation2 ship and hierarchical relationship) is a binary relation between two words. It is assumed to be symmetric and transitive. Using the character of the relationship , we can infer more new synonyms. For example : X s y , y s Z then X s Z If word x and y are synonyms , y and z are synonyms , then word x and z are synonyms. 4 Page Rank Method PageRank algorithm based on the definition in dictionary is based on the assumption that synonyms have many words in com2 mon in their definitions and are used in the definition of many com2 mon words. In the PageRank algorithm , we analyze the relation links between a given word and the other words , then construct the associated word graph. Finally , we convert the associate word graph into adjacent matrix , and use the PageRank algorithm to calculate the similarity degree and discover the synonyms in the as2 Table 1 Some indicator words of each mode of definition in dictionary Definition mode sociated word graph. The overall process is described in the following steps ( see Figure 2 and Figure 3) : Indicator word Setp 1 : Use the traditional dictionary2based method ( Maxi2 Synonyms definition mode 简称 ; 又 称 ; 又 称 为 ; 亦 称 ; 参见 ; 见 ; …的简称 mum Matching Method) to segment the definition of the word. The Antonyms definition mode …的对称 ; 相对于 …而言 Chinese word is matched to the dictionary in memory by Binary 主 要 形 式 有 …; 包 括 …; 分为 … Search so as to save time. Example and divide definition mode Step 2 : Analyze the relation links between a given word and the other words , then construct the associated word graph. Each In addition , we also define some exclusive rules in order to associated word is a vertex of the graph and there is an edge from u improve the results of the rule matching. These exclusive rules are to v , if u appears in the definition of v. The associated word used to eliminate the words which match the extraction rules but graph includes the words which point to the given word or are ・ 情报理论与实践・ — 473 — ! 信息系统 # ITA pointed to by the given word. 5 Results We build the synonyms mining system with platform of Visual Basic. Net. In order to be able to evaluate the system perfor2 mance , we examine the result given by the system. The mining practice of financial dictionaries shows that the precisions of PageRank algorithm and pattern matching algorithm reach 8516 % and 90 % respectively. The test result indicates that the system is feasible and practical ( see Table 3 and Table 4) . Table 3 Figure 2 Graph of word“市场汇率” Pattern matching extraction result in the definitions of 700 words 0 0 0 0 0 0 0 0 Extract rule 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 Precision Recall words extracted words extracted not extracted Synonyms 143 5 25 9616 % 8216 % Antonyms 24 0 4 100 % 8517 % 17 2 28 8915 % 36 % Narrow 0 0 0 1 0 0 0 0 Number of correct Number of error Number of words word 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 Table 4 1 1 1 0 0 0 0 0 PageRank extraction result in the definitions of 800 words Figure 3 Adjecent matrix of word“市场汇率” Step 3 : Convert the graph into adjacent matrix. Applying the Number of Number of words correct words extracted extracted Proportion Two2character words 12 10 PageRank formula to compute the PageRank score for each word in Three2character words 41 32 78 % the word graph ( see Table 2) . Four2character words 326 291 8913 % PR (A) = (1 - d) + d ( PR ( T1) / C ( T1) + . . . PR ( Tn) / C ( Tn) ) + ( PageRank formula) PR (x) is the PageRank of x , C ( x) is the number of out2 8313 % Five2character words 140 106 7517 % Six2character words 186 158 8419 % Seven2character words 120 109 9018 % 825 706 8516 % bound links on word x , d is a damping factor set between 0 and 1. The experimental system also needs further improvement , such as the methods of pattern gathering , the construction of the word2derived dictionary and the size & scope of the test set. □ Table 2 PageRank score of each word in the graph 外汇 供求 外汇汇率 汇率 市场汇率 自由兑换 浮动汇率 均衡汇率 市场 关系 管制 1 1 1 1 1 1 1 1 019875 01925 019625 019875 019875 Reference 1 Higgins D. Which Statistics Reflect Semantics ? Rethinking Synonymy 1 0185 0185 0185 2 0185 0185 0185 01966875 019240625 019559375 01966875 01966875 3 0185 0185 0185 01966875 019240625 019559375 01966875 01966875 2 Ristad E S , Yianilos P N. Learning String Edit Distance. In : IEEE 4 0185 0185 0185 01966875 019240625 019559375 01966875 01966875 Transactions on Pattern Analysis and Machine Intelligence , 1998 (5) and Word Similarity. International Conference on Linguistic Evidence , Tü bingen , Germany , 2004 3 Edmundson H P. Axiomatic Characterization of Synonymy and Antonymy. International Conference on Computational Linguistics , Grenoble , 1967 Step 4 : Set the threshold criteria to select the most similar 4 Edmundson H P. Computer2aided Research on Synonymy and word as synonyms. The criteria includes : PageRank score , the Antonymy. International Conference on Computational Linguistics , number of output synonyms and literal similarity. Stockholm , 1969 — 474 — ・ 第 29 卷 2006 年第 4 期・ ! 信息系统 # ITA 5 Turney P D. Mining the Web for Synonyms : PMI - IR versus LSA on TOEFL. The 13 宋明亮 . 汉语词汇字面相似性原理与后控制词表动态维护研 究 . 情报学报 , 1996 (4) European Conference on Machine Learning 14 吴志强 . 经济信息后控制词表的研究 : [ 学位论文 ] . 南京 : ( ECML2001) , Freiburg , Germany , 2001 6 Senellart P P. Extraction of Information in Large Graphs. Automatic Search for Synonyms. Technical Report 90 , 2001 南京农业大学 , 1999 15 许勇 . 基于百科词典的知识获取系统的研究与实现 . 北京 : 7 Senellart P P , Blondel V D. Automatic Discovery of Similar Words. In : Survey of Text Mining. New York : Springer2Verlag , 2003 北京工业大学 , 2001 16 章成志 . 基于文本层次模型的 Web 概念挖掘研究 — — —基于概 8 Richardson R , Smeaton A F , Murphy J . Using WordNet as a Knowl2 念语义网络的自动标引和自动分类研究 : edge Base for Measuring Semantic Similarity between Words. In : Pro2 ceedings of AICS Conference. Dublin : Trinity College , 1994 [ 学位论文 ] . 南 京 : 南京农业大学 , 2002 17 朱毅华 . 智能搜索引擎中的同义词识别算法研究 : [ 学位论 9 Blondel V D , Senellart P P. Automatic Extraction of Synonyms in a Dictionary. In : Proceedings of the SIAM Text Mining Workshop , 文 ] . 南京 : 南京农业大学 , 2001 Brief Introduction to Authors : Arlington , VA : [ s. n. ] , 2002 Lu Yong , born in 1979 , master. Major research areas : in2 10 贾爱平 . 科技文献中术语定义的语言模式研究 : [ 学位论文 ] . telligent information processing. 北京 : 北京语言文化大学 , 2002 Hou Hanqing , male , born in 1943 , professor , doctoral su2 11 陆勇 , 侯汉清 . 用于信息检索的同义词自动识别及其进展 . pervisor. Major research areas : intelligent information processing , 南京农业大学学报 ( 社会科学版) , 2004 (3) 12 陆勇 , 侯汉清 . 基于词典注释的汉语同义词自动识别 . 见 : 第一届全国信息检索与内容安全学术会议 论 文 集 . 上 海 : information retrieval. Received on Mar. 9 , 2006 [ 出版者不详 ] , 2004 面向信息检索的汉语同义词自动识别和挖掘 陆 勇 ( 南京农业大学信息管理系 江苏 210095 ; 南京信息工程大学校办 江苏 210044) 侯汉清 ( 南京农业大学信息管理系 江苏 210095) 摘 要 : 为了提高同义词自动挖掘的效率 , 本文提出了从词典释义中自动识别和挖掘同义词的方法 , 使用超链接分 析算法和模式匹配算法 , 从不同的角度提取同义词 : 第一部分是把词汇之间注释与被注释的关系看成是一种链接关系 , 对给定的词汇进行分析 , 把与给定词汇具有链接关系的所有相关词汇构造一个词汇图 , 图中的每一个节点代表相关词 , 每条弧代表了词汇之间注释与被注释的关系 。利用超链接分析方法并结合 PageRank 算法 , 计算词汇的 PageRank 值 , 把 PageRank 值看成是体现词汇之间语义相似性的衡量指标 , 最后为每一个词汇生成候选同义词集 , 并通过一定的筛选原则 和方法 , 推荐出最佳的同义词 。第二部分是利用词汇定义模式 , 对词汇的释义方式进行分析 , 归纳总结出在词典释义中 同义词出现的模式 , 进而利用模式匹配方法识别和挖掘同义词 。此外 , 利用模式匹配方法对 Web 网页和期刊论文中的同 义词也进行了挖掘测试 。测试结果表明 , 利用模式匹配和超链接分析方法来自动识别和挖掘同义词具有可行性和实用 性。 关键词 : 汉语同义词 ; 自动识别 ; 自动挖掘 ; 模式匹配 ; PageRank 算法 ・ 情报理论与实践・ — 475 —
© Copyright 2026 Paperzz