一般性演講

A Proposed Tag Set for
Exchanging Word-Segmented
Text Corpora
Jing-Shin Chang
[email protected]
http://www.bdc.com.tw/~shin/
Behavior Design Corporation
Technical Issues in Designing a Tag
Set for a Stratified WS Standard

Why a Stratified WS Standard ?
 “Words”
are generated by various mechanisms
 No common agreement on all WS criteria due
to different processing models of different
researchers and institutions on the mechanisms
 Stratification help the exchange of corpora and
conversion to appropriate private word units

What Tags and Attributes, and Why?
Text Generation Mechanisms
Behind Word Stratification

Lexicon Selection
 Basic
Lexicon (“Standard Dictionary”)
 Derivational Processes (non-enumerable)
 simple
variants (color/colour;呆子/獃子;兇手/凶手)
 regular expressions (numbers, word patterns)
 regular derivational processes (proper nouns,
abbreviations, compounding, …)

Text Planning
 Writing Variants
(symbols, punctuations)
What Tags/Attributes, and Why?

Tags for carrying linguistics information
 word
boundary
 level of stratification (in terms of a standard)
 misc.

(e.g., symbols and punctuations in text)
Tags for carrying conforming information
 standard/substandard
 so
of conformance
as to convert to-and-from private systems easily
 to allow user extension on (sub)standard(s) &
overcome time variant issues
Tags for carrying linguistics
information

Tags:
 <w0>(~信級詞):
words in standard dictionary
 <w1>(~達級詞): morphologically derived
 <w2>(~雅級詞): derived through compounding
regularity

Attributes:
 POS
(part of speech), tt (token type, derived word
type), hwds (embedded head words), rel
(relationship among embedded head words)
Tags for carrying conforming
information

Tags: <wstxt>, <ws0p>, <ws1p>, <ws2p>,
<p> (un-segmented para.)
 paragraphs
of various stratification level
conforming to specified standard/substandards

Attributes: WS, Dict, MR, NUM, NAM,
CMPR, DR, GR, specifying:
 conformed
“standard resources”
 user extension:

e.g., Dict=“CNS-WS-Dict-1998-1,X-BDC-WS-Dict-1998-2”
Attributes On Standardized
Resources: Why ?

Official standard (and thus tags) should be
defined in terms of explicitly specified and
unambiguously testifiable resources!!
 with
(optional) mechanism for user extension
 e.g., Charset registry, RFC (Internet standards)

Every resource is assigned a symbolic name
(referenced in attribute) for conforming test
 for
conversion to/evaluation in private systems
Attributes On Standardized
Resources (Cont.)

WS: WS standard, the collection of a set of
substandards (such as Dict, MR, ...).


e.g., CNS-WS-1998-1=CNS-WS-{Dict, MR, …}-1998-1
Dict: standard dictionary (basic lexicon)
 qualified
basic words
 POS: optional (referred by other substandards)

MR: morphological rules/standard
 qualified
affix/prefix/suffix
 qualified combination patterns
Attributes on Standardized
Resources (Cont.) [& Arguable]
NUM: numbering rules/patterns
 NAM: naming rules/patterns

 qualified
family names
 length constraints, abbreviations, standard
translations of foreign names, ...
Attributes on Standardized
Resources (Cont.) [& Arguable]
CMPR: compound formation rules/patterns
 DR: other derivational rules not in the
above substandards
 GR: private rules/patterns/description

Example: Simplest Encoding



<!-- The whole segmented text is enclosed by the "wstxt" tag; conforming standard is specified with
attributes. Words are space-delimited, and are conforming to the “w0”, “w1” or “w2” standard
depending on weather they are enclosed in “ws0p”, “ws1p” or “ws2p” -->
<wstxt dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-19982.3">
 <!-- use spaces as default word boundaries w/o using word tags -->
 <ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE>

中文 分詞 標準 必須 一 步 一 步 小心 地 制定 .
 </ws0p>
 <ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR1998-2.3" verified=TRUE>

中文 分詞 標準 必須 一步一步 小心地 制定 .
 </ws1p>
 <ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR1998-2.3" CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE>

中文分詞標準 必須 一步一步 小心地 制定 .
 </ws2p>
</wstxt>
Example: Using Word Tags







<!-- use word tags to identify word boundaries -->
<ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE>

<w0>中文</w0><w0>分詞</w0><w0>標準</w0><w0>必須</w0>

<w0 pos=quan>一</w0><w0>步</w0><w0>一</w0><w0>步</w0>

<w0>小心</w0><w0>地</w0><w0>制定</w0><w0>.</w0>
</ws0p>
<ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3”
verified=TRUE>

<w0>中文</w0> <w0>分詞</w0> <w0>標準</w0> <w0>必須</w0>

<w1><w0>一</w0><w0>步</w0><w0>一</w0><w0>步</w0></w1>

<w1><w0>小心</w0><w0>地</w0></w1> <w0>制定</w0> <w0>.</w0>
</ws1p>
<ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3”
CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE>

<w2><w0>中文</w0><w0>分詞</w0><w0>標準</w0></w2>

<w0>必須</w0>

<w1><w0>一</w0><w0>步</w0><w0>一</w0><w0>步</w0></w1>

<w1><w0>小心</w0><w0>地</w0></w1>

<w0>制定</w0><w0>.</w0>
</ws2p>
Example: Derived Words and
Token Type (TT) Attribute



<ws1p MR="CNS-WS-MR-1998-2.3" DR="CNS-WS-DR-1988.1.2">
<!-- examples of derived w1 words (from "w0" words) -->
 <w1 tt=(common_noun,suffix) MR="CNS-WS-MR-1998-2.3">
 <w0>孩子</w0><w0>們</w0>
 </w1>
 <w1>
 <w0 pos=quan>一萬</w0>
 <w0>朵</w0>
 <w0 pos=quan>一萬</w0>
 <w0>朵</w0>
 </w1><w0>地</w0><w0>送</w0>
</ws1p>
Example: Application of (Hwrds,
Rel) Attributes for Punctuations










<!-- examples of tagging punctuation enclosed/delimited words -->
<w1 hwds="高中,高職" rel=AND_OR>
 <w0>高中</w0><w0>(</w0><w0>職</w0><w0>)</w0>
</w1>
<!-- words with the same (hwds,rel) could be normalized to the same internal
representation of a private system -->
<w1 hwds="中山南路,中山北路" rel=AND>
 <w0>中山</w0><w0>南</w0><w0>、</w0><w0>北</w0><w0>路</w0>
</w1>
<w0>與</w0>
<w1 hwds="中山南路,中山北路" rel=AND>
 <w0>中山</w0><w0>南</w0><w0>(</w0><w0>北</w0><w0>)</w0><w0>路
</w0>
</w1>
<w0>意義</w0><w0>相同</w0><w1>...</w1><w1>...</w1>
Future Issues

Specification of the Official WS Standard
 standard
resources and substandards to be
defined in the first official version

Construction of Basic Lexicon
 basic
vs. derivational words
 standardization of the derivational parts

Registration of User Extension & Evolution
of the Official Standard