Word segmentation of written texts for mono

© ISO 2005 – All rights reserved
ISO/TC 37/SC 4 AWI N309
Date: 2006-08-18
ISO/AWI N309
ISO/TC 37/SC 4/WG 2
Secretariat: Key-Sun, Choi
Language resource management - Word segmentation of written texts for
mono-lingual and multi-lingual information processing - Part 1:
General principles and methods
Warning
This document is not an ISO International Standard. It is distributed for review and comment. It is subject to
change without notice and may not be referred to as an International Standard.
Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of
which they are aware and to provide supporting documentation.
Document type: International standard
Document subtype: if applicable
Document stage: (9) Preparation
Document language: E
Copyright notice
This ISO document is a working draft or committee draft and is copyright-protected by
ISO. While the reproduction of working drafts or committee drafts in any form for use by
participants in the ISO standards development process is permitted without prior
permission from ISO, neither this document nor any extract from it may be reproduced,
stored or transmitted in any form for any other purpose without prior written permission
from ISO.
Requests for permission to reproduce this document for the purpose of selling it should
be addressed as shown below or to ISO’s member body in the country of the requester:
[Indicate :
the full address
telephone number
fax number
telex number
and electronic mail address
as appropriate, of the Copyright Manager of the ISO member body responsible for the
secretariat of the TC or SC within the framework of which the draft has been prepared]
Reproduction for sales purposes may be subject to royalty payments or a licensing
agreement.
Violators may be prosecuted.
2
1
Scope
The word segmentation international standard series (Part 1, Part 2 and Part 3) target at any
natural language in which the word boundaries of its written text cannot be fully identified, for
example, Chinese, Japanese, Korean, Thai, Vietnamese, Mongolian, and Tibetan, by
typographic properties (such as spaces in English).
These Standards concern what the output should be for any input text after the process of
word segmentation, pursuing the consistency in word segmentation within/among texts to the
maximum extent so as to meet the requirements from a variety of applications in language
information processing, -- both mono-lingual and multi-lingual. The applications include but not
limited
to
natural
language
processing,
information
retrieval,
search
engine,
question-answering, machine translation and machine aided translation, pre-processing of
text-to-speech, post-processing of speech recognition, OCR and other character input
methods, proof reading, digital library, terminology and ontology, semantic web, eBusiness
and eCommerce, content management, and natural-language-based computer-aided
eLearning (including language learning and second language learning). They shall also be
helpful for orthographic processing (Romanization) of text in some languages such as
Chinese.
The Standards shall not account for word segmentation algorithms, though all the factors
considered here, lexicon for example, are necessary for the algorithm design and
implementation.
The Standard presented here is Part 1 in the word segmentation standard series, with
emphasis on the general principles and methods in word segmentation.
The Standard should be used in close conjunction with ISO 16642:2003, Terminology Markup
Framework, with ISO 12620, Terminology and other language resources ― Data categories
for electronic lexical resources (DCR), with ISO WD 24613:2004, Language resource
management—Lexical markup framework (LMF), and with ISO WD *** Morphosyntactic
Annotation Framework, particularly in the representation of lexical items and word
segmentation output.
2
Normative references
The following normative documents contain provisions of this Standard. It should be noted that,
generally, the definitions for the related concepts given in these documents apply in this
Standard, though the definitions given here may take the priority if there exist some degree of
inconsistency between the normative documents and this Standard.
ISO 639-1:2002, Codes for the representation of names of languages – Part 1: Alpha-2 Code.
ISO 639-2:1998, Code for the representation of languages – part 2: Alpha-3 Code.
3
ISO 639-3:200?, Codes for the representation of languages – Part 3: Alpha-3 Code for the
comprehensive coverage of languages
ISO 704:2000, Terminology work – Principles and methods
ISO 860:1996, Terminology work – Harmonization of concepts and terms
ISO 1087-1:2000, Terminology – Vocabulary – Part 1: Theory and application
ISO 1087-2:1999, Terminology – Vocabulary – Part 2: Computer application.
ISO/IEC 10646-1:2003, Information technology – Information technology -- Universal
Multiple-Octet Coded Character Set (UCS)
ISO/IEC 11179-3:2003, Information Technology – Data management and interchange –
Metadata Registries (MDR) – Part 3: Registry Metamodel (MDR3)
ISO 12620: 1999, Computer applications in terminology – Data categories
ISO 16642:2003, Computer applications in terminology – TMF (Terminological Markup
Framework)
3
Terms and definitions
For the purpose of clarity, two sets of terms are defined in this Standard: Core terms and
peripherial terms. The core terms are necessary for this Standard while the peripherial terms
are not necessary but closely related to the context of this Standard.
3.1 Core terms
3.1.1 Morphology
The study of the structure and formation of words. In general, there are two sub-types of
morphology, lexical morphology and inflectional morphology.
3.1.2 Word
A basic grammatical unit, and a relatively independent carrier of meaning, of language that can
stand alone to make up sentences. The unit is basically at the morphological level of a
language and is intuitively and mentally available for native speakers. A word is abstracted to a
lexeme in the lexicon, with at least a part of speech. A word may consist of a single morpheme
or of a combination of morphemes.
3.1.3 Lexeme
A basic abstract unit of the lexicon which is concretely realized in word forms. A lexeme may
also be a part of another lexeme in terms of word formation, such as derivation and
compounding. Free morphemes are the simplest lexemes. In this Standard, lexeme is used
synonymously for word. The citation form of a lexeme, or lemma, is that word form belonging
to the lexeme which is conventionally chosen to represent the lexeme.
EXAMPLE: (English) find, found, and finding are word forms of the lexeme FIND (In writing
here, lexemes are generally distinguished by the use of capital letters). And, find is the citation
form, or lemma, for FIND. [Japanese] ******. [Korean] ******.
4
3.1.4 Lexicon
The totality of all the established words (or more precisely, lexical item) of a language, seen
either as a list or as a structured whole, with information given for each lexical item, usually
including pronunciation, meaning, morphological properties and syntactic properties. The
lexicon is an abstract linguistic entity. It can be organized as dictionary, usually electronic
dictionary in computers (either machine readable or machine tractable), or as mental lexicon in
native speaker’s mind.
3.1.5 Dictionary
A concrete realization of (a part of) the lexicon, usually an electronic dictionary in the context of
this Standard. The collection of lexical items in a dictionary is called its word-list.
NOTE: A dictionary in general can never provide a full coverage of the lexicon due to the
practical limitations of size and requirements of language users, as well as the morphological
complexity in the language.
3.1.6 Mental lexicon
The mental representation of lexical knowledge in the brain of the individual language user.
NOTE: The mental lexicon of an individual is always smaller than the lexicon.
3.1.7 Lemma (also called Lexical item)
In this Standard, lemmas refer to all the lexical items listed in the dictionary, including not only
words strictly defined in linguistics, but also fixed expressions (phrasal words, idioms, technical
terms), and possibly proverbs and familiar quotations. In the latter case, the lemma for a
lexical item is derived from the lemmas of its constituent elements, if applicable.
NOTE: The lemma for (He) kicked the bucket (in the accident) is kick the bucket.
3.1.8 Word forms
The concretely realized grammatical forms of a word, or equivalently, of a lexeme, according
to its grammatical categories in the context of a sentence. Word forms in the spoken language
can be transcribed. Word forms in the written language can be termed orthographic words.
The concept of word forms can be extended to any lexical item in the way that any linguistically
validated word-forms combination of its constituent elements is regarded as a word form of
that lexical item.
NOTE: The concept of word forms plays a key role in the infecting languages and agglutinating
languages, like English and Japanese, but in general not useful for the isolating languages,
like Chinese and Vietnamese.
3.1.9 Morpheme
The smallest meaningful element of language that, as a basic phonological and semantic
element at a level lower than the word, cannot be reduced into smaller elements. Or briefly, a
5
grammatical unit which is used to constitute words. The morpheme is an abstract unit. It is
represented phonetically and phonologically by, or realized in, morphs. Morphemes can be
classified into two sub-types, free morpheme and bound morpheme.
3.1.10 Free morpheme
Morphemes that can stand by themselves.
EXAMPLE: [Chinese] 猪(pig), a single-character word, is a free morpheme. [Japanese] ******.
[Korean] ******.
3.1.11 Bound morpheme
Morphemes that appear only together with other morphemes to form a lexeme.
EXAMPLE: [Chinese], 伟 is a bound morpheme – it means great by its character meaning, but
cannot function as a word in text. Instead, it can be used as a constituent part of many words,
such as 伟大(great), 伟人(giant) and 雄伟(majesty). [Japanese] ******. [Korean] ******.
3.1.12 Morph
The constituent element of a word form. It is a concrete realization of a morpheme.
3.1.13 Realization
In the sense used here, ‘realize’ means ‘to make real’. Abstract entities are realized by entities
which have a form. So word forms realize lexemes, morphs realize morphemes, and a
dictionary realizes a lexicon.
3.1.14 Lemmatization
The process of determining the lemma for a given word form in the context of a sentence,
usually accompanied by determining the part of speech of that word.
EXAMPLE: [English] The lemma for finding is determined as find by lemmatization. [Japanese]
******. [Korean] ******.
NOTE: Lemmatization in general does not make sense for the isolating languages, such as
Chinese.
3.1.15 Root
A root is that part of a word form which remains when all inflectional and derivational affixes
have been removed. It is a basic part of a lexeme and cannot be further analyzed into smaller
morphs.
EXAMPLE: [English] The root in the word form destabilized is stabil-, derived from removing
the derivational affixes de- and -ize, as well as the inflectional suffix -(e)d. [Japanese] ******.
[Korean] ******.
6
3.1.16 Stem
A stem is that part of a word form which remains when all inflectional affixes have been
removed. A stem consists minimally of a root, but may be analyzable into one or many roots,
together with the associated derivational affixes. If a stem does not occur by itself in a
meaningful way in a language, it is referred to as a bound morpheme.
EXAMPLE: [English] The stem for the word form destabilized is destabilize, which includes the
root stabil-, the derivational affixes de- and -ize, but not the inflectional suffix -(e)d.
[Japanese] ******. [Korean] ******.
NOTE: In English, the lemma for a word form always uses its stem. The difference is that, the
lemmatization for isn’t will lead to two lemmas, is and not, whereas any operation concerning
stem is not applicable to isn’t.
3.1.17 Part of speech (also called Lexical category, or Word class)
In grammar, the part of speech of a word (or more precisely lexical item) is defined as the role
that the word plays in a sentence, by its particular syntactic or morphological behaviors.
3.1.18 Lexical morphology
A branch of morphology that deals with word formation.
3.1.19 Inflectional morphology
A branch of morphology that deals with inflection.
3.1.20 Word formation
The creation/building of words in a language. The greatest part of word-formation can be
subsumed under the processes of derivation, compounding, abbreviation and borrowing.
3.1.21 Inflection
The process of adding inflectional affixes to a lexeme to create its word forms. Inflection is a
grammatical process, rather than a lexical process.
3.1.22 Affix
A bound morpheme which does not realize a lexeme and may be added to a stem.
Functionally, affixes can be of two categories, inflectional or derivational. Affixes can also be
classified into three main sub-types according to their placement on the stem: prefix, suffix and
infix.
3.1.23 Inflectional affix
Affixes that can produce the word forms of a lexeme.
3.1.24 Derivational affix
Affixes that can produce a new lexeme from an existing lexeme.
7
3.1.25 Affixation
A process in which an affix is added to a stem.
3.1.26 Prefix
A sub-type of affix that is placed before the stem.
3.1.27 Prefixation
A process of affixation in which a prefix is added to the beginning of a stem.
3.1.28 Suffix
A sub-type of affix that is placed after the stem.
3.1.29 Suffixation
A process of affixation in which an suffix is added to the end of a stem.
3.1.30 Infix
A sub-type of affix that is inserted into a stem.
3.1.31 Infixation
A process of affixation in which an infix is inserted into a stem.
3.1.32 Derivation
A process of word formation in which a derivational affix is added to a lexeme to create a new
lexeme. Derivation is a lexical process.
3.1.33 Conversion (also called Zero derivation)
A process of word formation in which a word is created from an existing word without any
change in form, but often with change in part of speech.
3.1.34 Reduplication
A morphological process in which the entire word, or part of it, is repeated. Reduplication is
used both in inflections to convey a grammatical function, and in lexical derivation to create
new words. Reduplication position may be initial, final, or internal. It can be in some cases
viewed as a special way of making affixes, both inflectional and derivational.
3.1.35 Compounding (also called Composition)
A process of word formation in which new lexemes are formed by adjoining two or more
lexemes. Compounding is a lexical process.
NOTE: Compounding should not be confused with derivation, where bound morphemes are
added to free ones.
3.1.36 Compound
8
The result of the process of compounding. A compound may be endocentric if it has a head, i.e.
the fundamental part that contains the basic meaning of the whole compound, and modifiers,
which restrict this meaning, or exocentric if it does not have a head. And, a compound can be
rather long. There are two main sub-types of compounds according to their degree of
lexicalization, word-compound and phrasal compound.
3.1.37 Word-compound
A compound whose overall meaning is often not predictable from its constituent parts.
Word-compound is a kind of word strictly defined in linguistics.
3.1.38 Phrasal compound
A compound that is used steadily and frequently in the language, even its overall meaning is
predictable from its constituent parts (it might be thus thought of as a phrase by some
linguists).
EXAMPLE: [English] Apple pie. [Chinese] 猪肉(pork) is composed of two single-character
wordw 猪(pig) and 肉(meat), and thus a phrasal compound. [Japanese] ******. [Korean] ******.
NOTE: In practice, there is neither a clear cut between word-compound and phrasal
compound nor a clear cut between phrasal compound and phrase due to the fuzziness in
semantic predictability and the degree of lexicalization, although theoretically the cut can be
very clear. Lexico-statistics, word frequency in particular, will play important roles in this
aspect.
3.1.39 Multi-word expression
An expression composed of an ordered group of words which can stand independently and is
used steadily and frequently in the language. Multi-word expressions include compounds (both
word-compounds and phrasal compounds), fixed expressions and technical terms.
3.1.40 Fixed expression
A multi-word expression whose constituent elements cannot be moved randomly or
substituted without distorting the overall meaning or allowing a literal interpretation. Fixed
expressions range from word-compounds, collocations to idioms. Some proverbs and even
familiar quotations can also be considered as fix expressions if they are used steadily and
frequently in the language.
3.1.41 Idiom
A fixed expression whose overall meaning is not always transparent from combination of the
meaning of its constituent elements. Frequently for an idiom, there is a diachronic connection
between the literal reading and the idiomatic reading. Some of these connections can only be
reconstructed through historical knowledge.
3.1.42 Colloquial expression
9
A sub-type of idiom which is normally used in the spoken language and considered more
informal than written discourse.
3.1.43 Collocation
A fixed expression concerning the semantic compatibility of two or more grammatically
adjacent items (words, in most cases), in which an idiomatic relation has been developed in
some degree, as evidenced by a habitual co-occurrence between these items. The
co-occurrence patterns may be syntactic or morphological. Collocations which are adjacent in
text can be treated as lexical items. This kind of collocations are more fixed than free word
combinations and less fixed than idioms.
3.1.44 Technical term
Terms used in a specific subject or domain. Technical terms more than two words are in
generally viewed as multi-word expressions.
3.1.45 Abbreviation
A process and result of word formation in which a shortened form of a word, phrase or term
which represents its full form is created by omitting words or letters/characters from the full
form, for the sake of brevity. Abbreviation is a lexical process.
3.1.46 Borrowing
A process of word formation in which an linguistic expression is borrowed from one language
to another language, usually when no term exists for the new object or concept. Among the
causes of such cross-linguistic influence may be various political, cultural, social, or economic
developments.
3.1.47 Loan word
Words borrowed from one language into another language, which have become lexicalized (or,
assimilated phonetically, graphemically, or grammatically) into the new language as the result
of borrowing.
3.1.48 Proper noun (also called Proper name)
The names of unique entities, for example, persons, places, or organizations.
3.1.49 word structure (also called Morphological structure of word)
The structure obtained by applying morphological analysis to a word (more precisely, a lexical
item) in the lexicon, or to a word form in text. Some lexical items can be analyzed into
structures whereas some others are not analyzable, according to morphology of the language.
3.1.50 Word segmentation
A process that performs morphological analysis to any input text of the language in the scope
of this Standard, with the identified word boundaries as its main output. It serves as the first
step in all the related NLP-oriented information processing systems in these languages.
10
3.1.51 Word segmentation unit (WSU)
A computation-oriented morphological unit that includes: (1) all the lexical items in the lexicon;
and (2) all the word forms, numeric strings, foreign character stings, word components (e.g.,
bound morpheme), and miscellaneous items that possibly appear in text.
3.1.52 Corpus
A collection of texts concerning actual language use, usually electronically stored and
processed.
3.1.53 Representative corpus of a language
A large enough and well balanced corpus which is appropriate for depicting the whole picture
of the language use.
3.1.54 Domain specific corpus
A corpus of a specific subject or domain which can reflect the language use in that subject or
domain.
3.1.55 Raw corpus
A corpus without any linguistic processing or annotation.
3.1.56 Annotated corpus
A corpus with high quality linguistic annotation at a certain linguistic level. An annotated corpus
at the level of word segmentation is an important resource for this Standard.
3.1.57 Type
Linguistic unit in a text or corpus representing a defined class.
3.1.58 Token
Occurrence of a type in a text or corpus.
NOTE: If the class is defined as all word forms of a lexeme, then the linguistic unit in this
setting is called word type, and all the occurrence of the word forms are called word tokens of
this word type.
3.1.59 Lexico-statistics
A variety of statistics that are helpful for the quantitative study on morphology of a language,
such as frequency, mutual information, and Chi-Square.
3.1.60 Frequency
The number of occurrence of a type in a text or corpus. Frequencies are established by means
of frequency counts on the basis of a corpus. In general, adequate estimation of frequencies
can be derived directly from the representative corpus of the language. Improved estimation
might be obtained by considering a variety of related factors, for example, frequency
distribution on domain specific corpora, different genres, and different periods of time.
11
Evidence of this kind can help in the selection of the word-list, particularly in quantifying the
degree of lexicalization.
NOTE: If the type is defined as word type, then what we get is called word frequency. The
corpus for this purpose should be annotated in advance at the word segmentation level.
3.1.61 Lexicalization
The process of making a word to express a concept. A possible word is said lexicalized if it has
become an established word. Possible words may be lexicalized if their meaning is no longer
the sum of the meanings of their parts, or if they are unproductive in formation, and may also
be lexicalized in other ways, for example, phrasal compounds, even with a quite productive
formation or a lack of semantic idiosyncrasy. In the above cases, the degree of lexicalization
varies from high to low, with fully semantic idiosyncrasy as one extreme(high), and phrasal
compound as another extreme(low).
EXAMPLE: [ENGLISH] The degree of lexicalization for honeymoon is high, and that for apple
pie is low.
3.1.62 Homograph
A type of lexical ambiguity. Two lexical items are homographic if they are orthographically
identical but have different meanings.
3.2 Peripherial terms
3.2.1 Grapheme
The smallest distinctive unit in a writing system of a language. There are two major types of
writing systems: non-phonological systems (pictographic, ideographic, cuneiform, and
logographic), and phonological systems (syllabic, alphabetic). A grapheme often represents a
morpheme or a whole word in non-phonological systems whereas represents a phoneme or a
syllable in phonological systems. Variants of any given grapheme are called allographs.
NOTE: In non-phonological systems (e.g. Chinese, Japanese Kanji) and syllabic phonological
systems (e.g. Japanese kana, Korean Hangul), grapheme is conventionally called character,
or character component in particular cases (e.g. Korean Hangul), whereas in alphabetic
phonological systems (e.g. English), grapheme is conventionally called letter. The number of
graphemes in phonological systems usually ranges from 20-30 to several dozens and, that in
non-phonological systems is usually several thousand or more.
3.2.2 Character
A character is any grapheme (including the so-called letter and character in writing systems),
number, space, punctuation mark, or other symbol that can be processed in a computer. The
list of characters, or character set, is defined by ISO/IEC 10646.
3.2.3 Syllable
12
Basic phonetic-phonological unit of the word or of speech that can be identified intuitively.
NOTE: Most of Chinese characters are monosyllabic and monomorphemic.
3.2.4 Delimiter
One or more characters used to indicate the beginning or the end of a character string.
3.2.5 Stemming
A computer program to determine a stem of a given word form. It is usually sufficient that the
related words are mapped to the same stem (in fact, the same root instead of a stem in some
cases), even if the stem may not be fully linguistically valid.
NOTE: Stemming can be regarded as an approximation of lemmatization. Stemming operates
on a single word without knowledge of the context, so its performance is not ideal, but it is
easier to implement, run faster, and the reduced accuracy may not matter for some
applications, such as information retrieval.
3.2.6 Tokenization
An operation of splitting up a string of characters into a set of tokens in terms of the tokens
defined. It is applicable to the text in natural language as well as the text in artificial language.
If the object is the natural language text that needs to do word segmentation, and tokens are
defined as word forms, then the task of tokenization in this setting could be thought of as
almost the same as the task of word segmentation.
NOTE: Tokenization is applicable to any languages, English for example, not just to languages
that need to be word segmentation, though the task of tokenization may vary from case to
case.
3.2.7 Word segmentation ambiguity
A text fragment for which at least two different word sequences over it can be found by string
matching with a pre-defined word-list.
NOTE: Word segmentation ambiguities can affect the accuracy of a word segmentation
program. Their resolution is difficult for computers but theoretically not a problem for human
annotators. It is a basic concern of word segmentation algorithms, thus beyond the scope of
this Standard.
3.2.8 Unknown word
Out-of-word-list words that appear in the text being processed by a word segmentation
program.
NOTE: unknown words can significantly affect the accuracy of a word segmentation program.
Their resolution is difficult for computers but theoretically not a problem for human annotators.
It is a basic concern of word segmentation algorithms, thus beyond the scope of this Standard.
13
3.2.9 Morpho-syntactic category
A linguistic category which links inflexions to syntactic categories, e.g. PERSON, GENDER,
NUMBER, TENSE, ASPECT, VOICE.
3.2.10 Orthography
The study of correct spelling according to established usage in a language..
3.2.11 Transcription
1
The process and result of representing speech sounds in phonetic symbols in a systematic
and consistent way. More than other systems, the IPA can be used most successfully as a
transcription language.
2
The process and result of the sounds of a word in the source language being conveyed by
letters\characters in the target language.
EXAMPLE: [Transcription1] Chinese is transcribed according to Pinyin system, and Japanese
according to romaji system.
3.2.12 Transliteration
The process and result of conversion of one writing system into another by converting each
character of the source language into a character of the target language.
NOTE: Transcription2 and transliteration are commonly done with proper nouns, such as the
names of people, places, and institutions.
3.2.13 Romanization
The representation of words written in a non-Latin script by means of the Latin alphabet, either
through transliteration or transcription.
EXAMPLE: Chinese(Pinyin), and Japanese(romaji).
NOTE: Romanization, orthography and word segmentation are three related factors in
languages that need word segmentation. In general, the correct Romanization, and
consequently, the correct orthography of a text, is highly depending on the correct word
segmentation of that text.
4
General principles and methodologies
4.1 Principles in applying this Standard to the text
4.1.1 Principle of full coverage
The standard should be applicable to any text that needs word segmentation.
4.1.2 Principle of consistency
14
The standard should be used in a consistent way to any text, and the output of using the
standard should also be consistent.
4.2 The universal principle of morphology
All languages have words and all languages have morphemes.
4.3 Principles for validating the word-hood of a linguistic unit
4.3.1 Principles from the linguistic perspective
In general, all the linguistic principles regarding word-formation hold.
(1) Principle of bound morpheme: If a bound morpheme is attached to a word, then the result
is a word.
(2) Principle of lexical integrity hypothesis: syntactic rules may not refer to the internal structure
of words. If a linguistic unit satisfies this principle, then it tends to be a word.
(3) Principle of unpredictability of a word meaning from its subparts: If a linguistic unit has the
property of semantic unpredictability, it is considered a word (or more precisely, a lexical
item).
(4) Principle of idiomatization: If a linguistic unit has the property of idiomatization, then it is
considered a lexical item.
(5) Principle of collocation: If a linguistic unit has the property of collocation, then it is
considered a lexical item.
(6) Principle of unproductivity: If a linguistic unit shows very poor productivity, then it tends to
be a lexical item.
4.3.2. Principles from the practical (pragmatic) perspective
(1) Principle of frequency: frequency is a basic index for the degree of lexicalization of a
linguistic unit.
(2) Gestalt principle in cognitive linguistics: Things are likely to be perceived as a whole. This
principle gives an evidence for the possibility of including phrasal compounds in the
lexicon even though they seem free combinations of their component parts which are
words too.
(3) Principle of prototype members in categories: According to the prototype theory in the
mental lexicon, prototype members in categories is more salient than non-prototype
members, and more accurately remembered in short-term memory and more easily
retained and accessed in long-term memory for human-beings. This principle provides a
rationale for including phrasal compounds which can serve as prototypes in a productive
word-formation pattern like apple pie in English and 猪肉(pork) in Chinese, where the
pattern are “fruit + pie” and “animal + meat” respectively.
(4) Principle of language economy: For a linguistic unit, if its inclusion in the lexicon can
decrease the difficulty of later linguistic analysis, then it is considered a lexical item. e.g.,
大中小学 in Chinese “university, middle school, and primary school”.
4.4 The full entry principle of lexicon
15
All the words which ‘exist’ are listed in the lexicon. The lexicon should be dynamic, being
adapted to the changes of language usage.
4.5 Principle for word segmentation output
(1) Principle of granularity: words identified in the text after word segmentation may have inner
structures, not simply inserting a space in between. The word structures generated add a
sort of flexibility in word segmentation granularity, thus add a sort of flexibility in word
segmentation.
(2) Principle of the scope maximization of affixation
(3) Principle of the scope maximization of compounding: with respect to a lexicon
(4) Principle of the segmentation for punctuation, foreign character strings, word components
and miscellaneous items appearing in the text.
4.6 Methodology in designing word segmentation
4.6.1 General architecture for word segmentation
The following components should be well defined and carefully prepared before performing
word segmentation.
(1) a lexicon, with high coverage to texts, and, possibly with morphological structures for some
lemmas, if applicable, respectively.
(2) word formation specification: both derivational, compounding and reduplication
(3) a complete prefix/semi-prefix list
(4) a complete suffix/semi-suffix list
(5) a complete free morpheme list
(6) a complete bound morpheme list
(7) special morpheme lists that have special functions in the process of word segmentation, for
example, inflectional affix for verbs in Japanese.
(8) corpus: to support the quantitative analysis of the lexicon (but not as a part of the
Standards)
4.6.2 The role and makeup of the lexicon
(1) The lexicon serves as a gold-standard in word segmentation so as to keep consistencies
in word segmentation to the maximum extent.
(2) For words, only lexemes, rather than word forms, are included in the lexicon.
(3) Monography in the lexicon should be removed.
16