© ISO 2005 – All rights reserved ISO/TC 37/SC 4 AWI N309 Date: 2006-08-18 ISO/AWI N309 ISO/TC 37/SC 4/WG 2 Secretariat: Key-Sun, Choi Language resource management - Word segmentation of written texts for mono-lingual and multi-lingual information processing - Part 1: General principles and methods Warning This document is not an ISO International Standard. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an International Standard. Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation. Document type: International standard Document subtype: if applicable Document stage: (9) Preparation Document language: E Copyright notice This ISO document is a working draft or committee draft and is copyright-protected by ISO. While the reproduction of working drafts or committee drafts in any form for use by participants in the ISO standards development process is permitted without prior permission from ISO, neither this document nor any extract from it may be reproduced, stored or transmitted in any form for any other purpose without prior written permission from ISO. Requests for permission to reproduce this document for the purpose of selling it should be addressed as shown below or to ISO’s member body in the country of the requester: [Indicate : the full address telephone number fax number telex number and electronic mail address as appropriate, of the Copyright Manager of the ISO member body responsible for the secretariat of the TC or SC within the framework of which the draft has been prepared] Reproduction for sales purposes may be subject to royalty payments or a licensing agreement. Violators may be prosecuted. 2 1 Scope The word segmentation international standard series (Part 1, Part 2 and Part 3) target at any natural language in which the word boundaries of its written text cannot be fully identified, for example, Chinese, Japanese, Korean, Thai, Vietnamese, Mongolian, and Tibetan, by typographic properties (such as spaces in English). These Standards concern what the output should be for any input text after the process of word segmentation, pursuing the consistency in word segmentation within/among texts to the maximum extent so as to meet the requirements from a variety of applications in language information processing, -- both mono-lingual and multi-lingual. The applications include but not limited to natural language processing, information retrieval, search engine, question-answering, machine translation and machine aided translation, pre-processing of text-to-speech, post-processing of speech recognition, OCR and other character input methods, proof reading, digital library, terminology and ontology, semantic web, eBusiness and eCommerce, content management, and natural-language-based computer-aided eLearning (including language learning and second language learning). They shall also be helpful for orthographic processing (Romanization) of text in some languages such as Chinese. The Standards shall not account for word segmentation algorithms, though all the factors considered here, lexicon for example, are necessary for the algorithm design and implementation. The Standard presented here is Part 1 in the word segmentation standard series, with emphasis on the general principles and methods in word segmentation. The Standard should be used in close conjunction with ISO 16642:2003, Terminology Markup Framework, with ISO 12620, Terminology and other language resources ― Data categories for electronic lexical resources (DCR), with ISO WD 24613:2004, Language resource management—Lexical markup framework (LMF), and with ISO WD *** Morphosyntactic Annotation Framework, particularly in the representation of lexical items and word segmentation output. 2 Normative references The following normative documents contain provisions of this Standard. It should be noted that, generally, the definitions for the related concepts given in these documents apply in this Standard, though the definitions given here may take the priority if there exist some degree of inconsistency between the normative documents and this Standard. ISO 639-1:2002, Codes for the representation of names of languages – Part 1: Alpha-2 Code. ISO 639-2:1998, Code for the representation of languages – part 2: Alpha-3 Code. 3 ISO 639-3:200?, Codes for the representation of languages – Part 3: Alpha-3 Code for the comprehensive coverage of languages ISO 704:2000, Terminology work – Principles and methods ISO 860:1996, Terminology work – Harmonization of concepts and terms ISO 1087-1:2000, Terminology – Vocabulary – Part 1: Theory and application ISO 1087-2:1999, Terminology – Vocabulary – Part 2: Computer application. ISO/IEC 10646-1:2003, Information technology – Information technology -- Universal Multiple-Octet Coded Character Set (UCS) ISO/IEC 11179-3:2003, Information Technology – Data management and interchange – Metadata Registries (MDR) – Part 3: Registry Metamodel (MDR3) ISO 12620: 1999, Computer applications in terminology – Data categories ISO 16642:2003, Computer applications in terminology – TMF (Terminological Markup Framework) 3 Terms and definitions For the purpose of clarity, two sets of terms are defined in this Standard: Core terms and peripherial terms. The core terms are necessary for this Standard while the peripherial terms are not necessary but closely related to the context of this Standard. 3.1 Core terms 3.1.1 Morphology The study of the structure and formation of words. In general, there are two sub-types of morphology, lexical morphology and inflectional morphology. 3.1.2 Word A basic grammatical unit, and a relatively independent carrier of meaning, of language that can stand alone to make up sentences. The unit is basically at the morphological level of a language and is intuitively and mentally available for native speakers. A word is abstracted to a lexeme in the lexicon, with at least a part of speech. A word may consist of a single morpheme or of a combination of morphemes. 3.1.3 Lexeme A basic abstract unit of the lexicon which is concretely realized in word forms. A lexeme may also be a part of another lexeme in terms of word formation, such as derivation and compounding. Free morphemes are the simplest lexemes. In this Standard, lexeme is used synonymously for word. The citation form of a lexeme, or lemma, is that word form belonging to the lexeme which is conventionally chosen to represent the lexeme. EXAMPLE: (English) find, found, and finding are word forms of the lexeme FIND (In writing here, lexemes are generally distinguished by the use of capital letters). And, find is the citation form, or lemma, for FIND. [Japanese] ******. [Korean] ******. 4 3.1.4 Lexicon The totality of all the established words (or more precisely, lexical item) of a language, seen either as a list or as a structured whole, with information given for each lexical item, usually including pronunciation, meaning, morphological properties and syntactic properties. The lexicon is an abstract linguistic entity. It can be organized as dictionary, usually electronic dictionary in computers (either machine readable or machine tractable), or as mental lexicon in native speaker’s mind. 3.1.5 Dictionary A concrete realization of (a part of) the lexicon, usually an electronic dictionary in the context of this Standard. The collection of lexical items in a dictionary is called its word-list. NOTE: A dictionary in general can never provide a full coverage of the lexicon due to the practical limitations of size and requirements of language users, as well as the morphological complexity in the language. 3.1.6 Mental lexicon The mental representation of lexical knowledge in the brain of the individual language user. NOTE: The mental lexicon of an individual is always smaller than the lexicon. 3.1.7 Lemma (also called Lexical item) In this Standard, lemmas refer to all the lexical items listed in the dictionary, including not only words strictly defined in linguistics, but also fixed expressions (phrasal words, idioms, technical terms), and possibly proverbs and familiar quotations. In the latter case, the lemma for a lexical item is derived from the lemmas of its constituent elements, if applicable. NOTE: The lemma for (He) kicked the bucket (in the accident) is kick the bucket. 3.1.8 Word forms The concretely realized grammatical forms of a word, or equivalently, of a lexeme, according to its grammatical categories in the context of a sentence. Word forms in the spoken language can be transcribed. Word forms in the written language can be termed orthographic words. The concept of word forms can be extended to any lexical item in the way that any linguistically validated word-forms combination of its constituent elements is regarded as a word form of that lexical item. NOTE: The concept of word forms plays a key role in the infecting languages and agglutinating languages, like English and Japanese, but in general not useful for the isolating languages, like Chinese and Vietnamese. 3.1.9 Morpheme The smallest meaningful element of language that, as a basic phonological and semantic element at a level lower than the word, cannot be reduced into smaller elements. Or briefly, a 5 grammatical unit which is used to constitute words. The morpheme is an abstract unit. It is represented phonetically and phonologically by, or realized in, morphs. Morphemes can be classified into two sub-types, free morpheme and bound morpheme. 3.1.10 Free morpheme Morphemes that can stand by themselves. EXAMPLE: [Chinese] 猪(pig), a single-character word, is a free morpheme. [Japanese] ******. [Korean] ******. 3.1.11 Bound morpheme Morphemes that appear only together with other morphemes to form a lexeme. EXAMPLE: [Chinese], 伟 is a bound morpheme – it means great by its character meaning, but cannot function as a word in text. Instead, it can be used as a constituent part of many words, such as 伟大(great), 伟人(giant) and 雄伟(majesty). [Japanese] ******. [Korean] ******. 3.1.12 Morph The constituent element of a word form. It is a concrete realization of a morpheme. 3.1.13 Realization In the sense used here, ‘realize’ means ‘to make real’. Abstract entities are realized by entities which have a form. So word forms realize lexemes, morphs realize morphemes, and a dictionary realizes a lexicon. 3.1.14 Lemmatization The process of determining the lemma for a given word form in the context of a sentence, usually accompanied by determining the part of speech of that word. EXAMPLE: [English] The lemma for finding is determined as find by lemmatization. [Japanese] ******. [Korean] ******. NOTE: Lemmatization in general does not make sense for the isolating languages, such as Chinese. 3.1.15 Root A root is that part of a word form which remains when all inflectional and derivational affixes have been removed. It is a basic part of a lexeme and cannot be further analyzed into smaller morphs. EXAMPLE: [English] The root in the word form destabilized is stabil-, derived from removing the derivational affixes de- and -ize, as well as the inflectional suffix -(e)d. [Japanese] ******. [Korean] ******. 6 3.1.16 Stem A stem is that part of a word form which remains when all inflectional affixes have been removed. A stem consists minimally of a root, but may be analyzable into one or many roots, together with the associated derivational affixes. If a stem does not occur by itself in a meaningful way in a language, it is referred to as a bound morpheme. EXAMPLE: [English] The stem for the word form destabilized is destabilize, which includes the root stabil-, the derivational affixes de- and -ize, but not the inflectional suffix -(e)d. [Japanese] ******. [Korean] ******. NOTE: In English, the lemma for a word form always uses its stem. The difference is that, the lemmatization for isn’t will lead to two lemmas, is and not, whereas any operation concerning stem is not applicable to isn’t. 3.1.17 Part of speech (also called Lexical category, or Word class) In grammar, the part of speech of a word (or more precisely lexical item) is defined as the role that the word plays in a sentence, by its particular syntactic or morphological behaviors. 3.1.18 Lexical morphology A branch of morphology that deals with word formation. 3.1.19 Inflectional morphology A branch of morphology that deals with inflection. 3.1.20 Word formation The creation/building of words in a language. The greatest part of word-formation can be subsumed under the processes of derivation, compounding, abbreviation and borrowing. 3.1.21 Inflection The process of adding inflectional affixes to a lexeme to create its word forms. Inflection is a grammatical process, rather than a lexical process. 3.1.22 Affix A bound morpheme which does not realize a lexeme and may be added to a stem. Functionally, affixes can be of two categories, inflectional or derivational. Affixes can also be classified into three main sub-types according to their placement on the stem: prefix, suffix and infix. 3.1.23 Inflectional affix Affixes that can produce the word forms of a lexeme. 3.1.24 Derivational affix Affixes that can produce a new lexeme from an existing lexeme. 7 3.1.25 Affixation A process in which an affix is added to a stem. 3.1.26 Prefix A sub-type of affix that is placed before the stem. 3.1.27 Prefixation A process of affixation in which a prefix is added to the beginning of a stem. 3.1.28 Suffix A sub-type of affix that is placed after the stem. 3.1.29 Suffixation A process of affixation in which an suffix is added to the end of a stem. 3.1.30 Infix A sub-type of affix that is inserted into a stem. 3.1.31 Infixation A process of affixation in which an infix is inserted into a stem. 3.1.32 Derivation A process of word formation in which a derivational affix is added to a lexeme to create a new lexeme. Derivation is a lexical process. 3.1.33 Conversion (also called Zero derivation) A process of word formation in which a word is created from an existing word without any change in form, but often with change in part of speech. 3.1.34 Reduplication A morphological process in which the entire word, or part of it, is repeated. Reduplication is used both in inflections to convey a grammatical function, and in lexical derivation to create new words. Reduplication position may be initial, final, or internal. It can be in some cases viewed as a special way of making affixes, both inflectional and derivational. 3.1.35 Compounding (also called Composition) A process of word formation in which new lexemes are formed by adjoining two or more lexemes. Compounding is a lexical process. NOTE: Compounding should not be confused with derivation, where bound morphemes are added to free ones. 3.1.36 Compound 8 The result of the process of compounding. A compound may be endocentric if it has a head, i.e. the fundamental part that contains the basic meaning of the whole compound, and modifiers, which restrict this meaning, or exocentric if it does not have a head. And, a compound can be rather long. There are two main sub-types of compounds according to their degree of lexicalization, word-compound and phrasal compound. 3.1.37 Word-compound A compound whose overall meaning is often not predictable from its constituent parts. Word-compound is a kind of word strictly defined in linguistics. 3.1.38 Phrasal compound A compound that is used steadily and frequently in the language, even its overall meaning is predictable from its constituent parts (it might be thus thought of as a phrase by some linguists). EXAMPLE: [English] Apple pie. [Chinese] 猪肉(pork) is composed of two single-character wordw 猪(pig) and 肉(meat), and thus a phrasal compound. [Japanese] ******. [Korean] ******. NOTE: In practice, there is neither a clear cut between word-compound and phrasal compound nor a clear cut between phrasal compound and phrase due to the fuzziness in semantic predictability and the degree of lexicalization, although theoretically the cut can be very clear. Lexico-statistics, word frequency in particular, will play important roles in this aspect. 3.1.39 Multi-word expression An expression composed of an ordered group of words which can stand independently and is used steadily and frequently in the language. Multi-word expressions include compounds (both word-compounds and phrasal compounds), fixed expressions and technical terms. 3.1.40 Fixed expression A multi-word expression whose constituent elements cannot be moved randomly or substituted without distorting the overall meaning or allowing a literal interpretation. Fixed expressions range from word-compounds, collocations to idioms. Some proverbs and even familiar quotations can also be considered as fix expressions if they are used steadily and frequently in the language. 3.1.41 Idiom A fixed expression whose overall meaning is not always transparent from combination of the meaning of its constituent elements. Frequently for an idiom, there is a diachronic connection between the literal reading and the idiomatic reading. Some of these connections can only be reconstructed through historical knowledge. 3.1.42 Colloquial expression 9 A sub-type of idiom which is normally used in the spoken language and considered more informal than written discourse. 3.1.43 Collocation A fixed expression concerning the semantic compatibility of two or more grammatically adjacent items (words, in most cases), in which an idiomatic relation has been developed in some degree, as evidenced by a habitual co-occurrence between these items. The co-occurrence patterns may be syntactic or morphological. Collocations which are adjacent in text can be treated as lexical items. This kind of collocations are more fixed than free word combinations and less fixed than idioms. 3.1.44 Technical term Terms used in a specific subject or domain. Technical terms more than two words are in generally viewed as multi-word expressions. 3.1.45 Abbreviation A process and result of word formation in which a shortened form of a word, phrase or term which represents its full form is created by omitting words or letters/characters from the full form, for the sake of brevity. Abbreviation is a lexical process. 3.1.46 Borrowing A process of word formation in which an linguistic expression is borrowed from one language to another language, usually when no term exists for the new object or concept. Among the causes of such cross-linguistic influence may be various political, cultural, social, or economic developments. 3.1.47 Loan word Words borrowed from one language into another language, which have become lexicalized (or, assimilated phonetically, graphemically, or grammatically) into the new language as the result of borrowing. 3.1.48 Proper noun (also called Proper name) The names of unique entities, for example, persons, places, or organizations. 3.1.49 word structure (also called Morphological structure of word) The structure obtained by applying morphological analysis to a word (more precisely, a lexical item) in the lexicon, or to a word form in text. Some lexical items can be analyzed into structures whereas some others are not analyzable, according to morphology of the language. 3.1.50 Word segmentation A process that performs morphological analysis to any input text of the language in the scope of this Standard, with the identified word boundaries as its main output. It serves as the first step in all the related NLP-oriented information processing systems in these languages. 10 3.1.51 Word segmentation unit (WSU) A computation-oriented morphological unit that includes: (1) all the lexical items in the lexicon; and (2) all the word forms, numeric strings, foreign character stings, word components (e.g., bound morpheme), and miscellaneous items that possibly appear in text. 3.1.52 Corpus A collection of texts concerning actual language use, usually electronically stored and processed. 3.1.53 Representative corpus of a language A large enough and well balanced corpus which is appropriate for depicting the whole picture of the language use. 3.1.54 Domain specific corpus A corpus of a specific subject or domain which can reflect the language use in that subject or domain. 3.1.55 Raw corpus A corpus without any linguistic processing or annotation. 3.1.56 Annotated corpus A corpus with high quality linguistic annotation at a certain linguistic level. An annotated corpus at the level of word segmentation is an important resource for this Standard. 3.1.57 Type Linguistic unit in a text or corpus representing a defined class. 3.1.58 Token Occurrence of a type in a text or corpus. NOTE: If the class is defined as all word forms of a lexeme, then the linguistic unit in this setting is called word type, and all the occurrence of the word forms are called word tokens of this word type. 3.1.59 Lexico-statistics A variety of statistics that are helpful for the quantitative study on morphology of a language, such as frequency, mutual information, and Chi-Square. 3.1.60 Frequency The number of occurrence of a type in a text or corpus. Frequencies are established by means of frequency counts on the basis of a corpus. In general, adequate estimation of frequencies can be derived directly from the representative corpus of the language. Improved estimation might be obtained by considering a variety of related factors, for example, frequency distribution on domain specific corpora, different genres, and different periods of time. 11 Evidence of this kind can help in the selection of the word-list, particularly in quantifying the degree of lexicalization. NOTE: If the type is defined as word type, then what we get is called word frequency. The corpus for this purpose should be annotated in advance at the word segmentation level. 3.1.61 Lexicalization The process of making a word to express a concept. A possible word is said lexicalized if it has become an established word. Possible words may be lexicalized if their meaning is no longer the sum of the meanings of their parts, or if they are unproductive in formation, and may also be lexicalized in other ways, for example, phrasal compounds, even with a quite productive formation or a lack of semantic idiosyncrasy. In the above cases, the degree of lexicalization varies from high to low, with fully semantic idiosyncrasy as one extreme(high), and phrasal compound as another extreme(low). EXAMPLE: [ENGLISH] The degree of lexicalization for honeymoon is high, and that for apple pie is low. 3.1.62 Homograph A type of lexical ambiguity. Two lexical items are homographic if they are orthographically identical but have different meanings. 3.2 Peripherial terms 3.2.1 Grapheme The smallest distinctive unit in a writing system of a language. There are two major types of writing systems: non-phonological systems (pictographic, ideographic, cuneiform, and logographic), and phonological systems (syllabic, alphabetic). A grapheme often represents a morpheme or a whole word in non-phonological systems whereas represents a phoneme or a syllable in phonological systems. Variants of any given grapheme are called allographs. NOTE: In non-phonological systems (e.g. Chinese, Japanese Kanji) and syllabic phonological systems (e.g. Japanese kana, Korean Hangul), grapheme is conventionally called character, or character component in particular cases (e.g. Korean Hangul), whereas in alphabetic phonological systems (e.g. English), grapheme is conventionally called letter. The number of graphemes in phonological systems usually ranges from 20-30 to several dozens and, that in non-phonological systems is usually several thousand or more. 3.2.2 Character A character is any grapheme (including the so-called letter and character in writing systems), number, space, punctuation mark, or other symbol that can be processed in a computer. The list of characters, or character set, is defined by ISO/IEC 10646. 3.2.3 Syllable 12 Basic phonetic-phonological unit of the word or of speech that can be identified intuitively. NOTE: Most of Chinese characters are monosyllabic and monomorphemic. 3.2.4 Delimiter One or more characters used to indicate the beginning or the end of a character string. 3.2.5 Stemming A computer program to determine a stem of a given word form. It is usually sufficient that the related words are mapped to the same stem (in fact, the same root instead of a stem in some cases), even if the stem may not be fully linguistically valid. NOTE: Stemming can be regarded as an approximation of lemmatization. Stemming operates on a single word without knowledge of the context, so its performance is not ideal, but it is easier to implement, run faster, and the reduced accuracy may not matter for some applications, such as information retrieval. 3.2.6 Tokenization An operation of splitting up a string of characters into a set of tokens in terms of the tokens defined. It is applicable to the text in natural language as well as the text in artificial language. If the object is the natural language text that needs to do word segmentation, and tokens are defined as word forms, then the task of tokenization in this setting could be thought of as almost the same as the task of word segmentation. NOTE: Tokenization is applicable to any languages, English for example, not just to languages that need to be word segmentation, though the task of tokenization may vary from case to case. 3.2.7 Word segmentation ambiguity A text fragment for which at least two different word sequences over it can be found by string matching with a pre-defined word-list. NOTE: Word segmentation ambiguities can affect the accuracy of a word segmentation program. Their resolution is difficult for computers but theoretically not a problem for human annotators. It is a basic concern of word segmentation algorithms, thus beyond the scope of this Standard. 3.2.8 Unknown word Out-of-word-list words that appear in the text being processed by a word segmentation program. NOTE: unknown words can significantly affect the accuracy of a word segmentation program. Their resolution is difficult for computers but theoretically not a problem for human annotators. It is a basic concern of word segmentation algorithms, thus beyond the scope of this Standard. 13 3.2.9 Morpho-syntactic category A linguistic category which links inflexions to syntactic categories, e.g. PERSON, GENDER, NUMBER, TENSE, ASPECT, VOICE. 3.2.10 Orthography The study of correct spelling according to established usage in a language.. 3.2.11 Transcription 1 The process and result of representing speech sounds in phonetic symbols in a systematic and consistent way. More than other systems, the IPA can be used most successfully as a transcription language. 2 The process and result of the sounds of a word in the source language being conveyed by letters\characters in the target language. EXAMPLE: [Transcription1] Chinese is transcribed according to Pinyin system, and Japanese according to romaji system. 3.2.12 Transliteration The process and result of conversion of one writing system into another by converting each character of the source language into a character of the target language. NOTE: Transcription2 and transliteration are commonly done with proper nouns, such as the names of people, places, and institutions. 3.2.13 Romanization The representation of words written in a non-Latin script by means of the Latin alphabet, either through transliteration or transcription. EXAMPLE: Chinese(Pinyin), and Japanese(romaji). NOTE: Romanization, orthography and word segmentation are three related factors in languages that need word segmentation. In general, the correct Romanization, and consequently, the correct orthography of a text, is highly depending on the correct word segmentation of that text. 4 General principles and methodologies 4.1 Principles in applying this Standard to the text 4.1.1 Principle of full coverage The standard should be applicable to any text that needs word segmentation. 4.1.2 Principle of consistency 14 The standard should be used in a consistent way to any text, and the output of using the standard should also be consistent. 4.2 The universal principle of morphology All languages have words and all languages have morphemes. 4.3 Principles for validating the word-hood of a linguistic unit 4.3.1 Principles from the linguistic perspective In general, all the linguistic principles regarding word-formation hold. (1) Principle of bound morpheme: If a bound morpheme is attached to a word, then the result is a word. (2) Principle of lexical integrity hypothesis: syntactic rules may not refer to the internal structure of words. If a linguistic unit satisfies this principle, then it tends to be a word. (3) Principle of unpredictability of a word meaning from its subparts: If a linguistic unit has the property of semantic unpredictability, it is considered a word (or more precisely, a lexical item). (4) Principle of idiomatization: If a linguistic unit has the property of idiomatization, then it is considered a lexical item. (5) Principle of collocation: If a linguistic unit has the property of collocation, then it is considered a lexical item. (6) Principle of unproductivity: If a linguistic unit shows very poor productivity, then it tends to be a lexical item. 4.3.2. Principles from the practical (pragmatic) perspective (1) Principle of frequency: frequency is a basic index for the degree of lexicalization of a linguistic unit. (2) Gestalt principle in cognitive linguistics: Things are likely to be perceived as a whole. This principle gives an evidence for the possibility of including phrasal compounds in the lexicon even though they seem free combinations of their component parts which are words too. (3) Principle of prototype members in categories: According to the prototype theory in the mental lexicon, prototype members in categories is more salient than non-prototype members, and more accurately remembered in short-term memory and more easily retained and accessed in long-term memory for human-beings. This principle provides a rationale for including phrasal compounds which can serve as prototypes in a productive word-formation pattern like apple pie in English and 猪肉(pork) in Chinese, where the pattern are “fruit + pie” and “animal + meat” respectively. (4) Principle of language economy: For a linguistic unit, if its inclusion in the lexicon can decrease the difficulty of later linguistic analysis, then it is considered a lexical item. e.g., 大中小学 in Chinese “university, middle school, and primary school”. 4.4 The full entry principle of lexicon 15 All the words which ‘exist’ are listed in the lexicon. The lexicon should be dynamic, being adapted to the changes of language usage. 4.5 Principle for word segmentation output (1) Principle of granularity: words identified in the text after word segmentation may have inner structures, not simply inserting a space in between. The word structures generated add a sort of flexibility in word segmentation granularity, thus add a sort of flexibility in word segmentation. (2) Principle of the scope maximization of affixation (3) Principle of the scope maximization of compounding: with respect to a lexicon (4) Principle of the segmentation for punctuation, foreign character strings, word components and miscellaneous items appearing in the text. 4.6 Methodology in designing word segmentation 4.6.1 General architecture for word segmentation The following components should be well defined and carefully prepared before performing word segmentation. (1) a lexicon, with high coverage to texts, and, possibly with morphological structures for some lemmas, if applicable, respectively. (2) word formation specification: both derivational, compounding and reduplication (3) a complete prefix/semi-prefix list (4) a complete suffix/semi-suffix list (5) a complete free morpheme list (6) a complete bound morpheme list (7) special morpheme lists that have special functions in the process of word segmentation, for example, inflectional affix for verbs in Japanese. (8) corpus: to support the quantitative analysis of the lexicon (but not as a part of the Standards) 4.6.2 The role and makeup of the lexicon (1) The lexicon serves as a gold-standard in word segmentation so as to keep consistencies in word segmentation to the maximum extent. (2) For words, only lexemes, rather than word forms, are included in the lexicon. (3) Monography in the lexicon should be removed. 16
© Copyright 2026 Paperzz