A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang [email protected] http://www.bdc.com.tw/~shin/ Behavior Design Corporation Technical Issues in Designing a Tag Set for a Stratified WS Standard Why a Stratified WS Standard ? “Words” are generated by various mechanisms No common agreement on all WS criteria due to different processing models of different researchers and institutions on the mechanisms Stratification help the exchange of corpora and conversion to appropriate private word units What Tags and Attributes, and Why? Text Generation Mechanisms Behind Word Stratification Lexicon Selection Basic Lexicon (“Standard Dictionary”) Derivational Processes (non-enumerable) simple variants (color/colour;呆子/獃子;兇手/凶手) regular expressions (numbers, word patterns) regular derivational processes (proper nouns, abbreviations, compounding, …) Text Planning Writing Variants (symbols, punctuations) What Tags/Attributes, and Why? Tags for carrying linguistics information word boundary level of stratification (in terms of a standard) misc. (e.g., symbols and punctuations in text) Tags for carrying conforming information standard/substandard so of conformance as to convert to-and-from private systems easily to allow user extension on (sub)standard(s) & overcome time variant issues Tags for carrying linguistics information Tags: <w0>(~信級詞): words in standard dictionary <w1>(~達級詞): morphologically derived <w2>(~雅級詞): derived through compounding regularity Attributes: POS (part of speech), tt (token type, derived word type), hwds (embedded head words), rel (relationship among embedded head words) Tags for carrying conforming information Tags: <wstxt>, <ws0p>, <ws1p>, <ws2p>, <p> (un-segmented para.) paragraphs of various stratification level conforming to specified standard/substandards Attributes: WS, Dict, MR, NUM, NAM, CMPR, DR, GR, specifying: conformed “standard resources” user extension: e.g., Dict=“CNS-WS-Dict-1998-1,X-BDC-WS-Dict-1998-2” Attributes On Standardized Resources: Why ? Official standard (and thus tags) should be defined in terms of explicitly specified and unambiguously testifiable resources!! with (optional) mechanism for user extension e.g., Charset registry, RFC (Internet standards) Every resource is assigned a symbolic name (referenced in attribute) for conforming test for conversion to/evaluation in private systems Attributes On Standardized Resources (Cont.) WS: WS standard, the collection of a set of substandards (such as Dict, MR, ...). e.g., CNS-WS-1998-1=CNS-WS-{Dict, MR, …}-1998-1 Dict: standard dictionary (basic lexicon) qualified basic words POS: optional (referred by other substandards) MR: morphological rules/standard qualified affix/prefix/suffix qualified combination patterns Attributes on Standardized Resources (Cont.) [& Arguable] NUM: numbering rules/patterns NAM: naming rules/patterns qualified family names length constraints, abbreviations, standard translations of foreign names, ... Attributes on Standardized Resources (Cont.) [& Arguable] CMPR: compound formation rules/patterns DR: other derivational rules not in the above substandards GR: private rules/patterns/description Example: Simplest Encoding <!-- The whole segmented text is enclosed by the "wstxt" tag; conforming standard is specified with attributes. Words are space-delimited, and are conforming to the “w0”, “w1” or “w2” standard depending on weather they are enclosed in “ws0p”, “ws1p” or “ws2p” --> <wstxt dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-19982.3"> <!-- use spaces as default word boundaries w/o using word tags --> <ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE> 中文 分詞 標準 必須 一 步 一 步 小心 地 制定 . </ws0p> <ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR1998-2.3" verified=TRUE> 中文 分詞 標準 必須 一步一步 小心地 制定 . </ws1p> <ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR1998-2.3" CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE> 中文分詞標準 必須 一步一步 小心地 制定 . </ws2p> </wstxt> Example: Using Word Tags <!-- use word tags to identify word boundaries --> <ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE> <w0>中文</w0><w0>分詞</w0><w0>標準</w0><w0>必須</w0> <w0 pos=quan>一</w0><w0>步</w0><w0>一</w0><w0>步</w0> <w0>小心</w0><w0>地</w0><w0>制定</w0><w0>.</w0> </ws0p> <ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” verified=TRUE> <w0>中文</w0> <w0>分詞</w0> <w0>標準</w0> <w0>必須</w0> <w1><w0>一</w0><w0>步</w0><w0>一</w0><w0>步</w0></w1> <w1><w0>小心</w0><w0>地</w0></w1> <w0>制定</w0> <w0>.</w0> </ws1p> <ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE> <w2><w0>中文</w0><w0>分詞</w0><w0>標準</w0></w2> <w0>必須</w0> <w1><w0>一</w0><w0>步</w0><w0>一</w0><w0>步</w0></w1> <w1><w0>小心</w0><w0>地</w0></w1> <w0>制定</w0><w0>.</w0> </ws2p> Example: Derived Words and Token Type (TT) Attribute <ws1p MR="CNS-WS-MR-1998-2.3" DR="CNS-WS-DR-1988.1.2"> <!-- examples of derived w1 words (from "w0" words) --> <w1 tt=(common_noun,suffix) MR="CNS-WS-MR-1998-2.3"> <w0>孩子</w0><w0>們</w0> </w1> <w1> <w0 pos=quan>一萬</w0> <w0>朵</w0> <w0 pos=quan>一萬</w0> <w0>朵</w0> </w1><w0>地</w0><w0>送</w0> </ws1p> Example: Application of (Hwrds, Rel) Attributes for Punctuations <!-- examples of tagging punctuation enclosed/delimited words --> <w1 hwds="高中,高職" rel=AND_OR> <w0>高中</w0><w0>(</w0><w0>職</w0><w0>)</w0> </w1> <!-- words with the same (hwds,rel) could be normalized to the same internal representation of a private system --> <w1 hwds="中山南路,中山北路" rel=AND> <w0>中山</w0><w0>南</w0><w0>、</w0><w0>北</w0><w0>路</w0> </w1> <w0>與</w0> <w1 hwds="中山南路,中山北路" rel=AND> <w0>中山</w0><w0>南</w0><w0>(</w0><w0>北</w0><w0>)</w0><w0>路 </w0> </w1> <w0>意義</w0><w0>相同</w0><w1>...</w1><w1>...</w1> Future Issues Specification of the Official WS Standard standard resources and substandards to be defined in the first official version Construction of Basic Lexicon basic vs. derivational words standardization of the derivational parts Registration of User Extension & Evolution of the Official Standard
© Copyright 2026 Paperzz