Unicode: What is it and how do I use it? The rationale for Unicode and its design goals and detailed design principles are presented. The correspondence between Unicode and ISO/IEC 10646 is discussed, the scripts included or planned for inclusion in the two character set standards are listed. Some products that support Unicode and some applications that require Unicode are listed, then examples of how to specify Unicode characters in a variety of applications are given. Use of Unicode in SGML and XML applications is discussed, and the paper concludes with descriptions of the character encodings used with Unicode and ISO/IEC 10646, plus sources of further information are listed. Tony Graham, Senior Consultant, Mulberry Technologies, Inc. What is Unicode? Unicode is a character encoding standard published by, of all people, the Unicode Consortium. Unicode is designed to include all of the major scripts of the world in a simple and consistent manner. The Unicode Standard also defines properties of the characters and algorithms for use in Unicode implementations. Existing character encodings are a mixture of one-byte and two-byte encodings—Chinese, Japanese, and Korean encodings need two bytes per character, and most other coded character sets use only one byte per character. Most coded character sets are more or less compatible with ASCII for the first 127 character numbers, but above that it's anybody's game. Changes between character encodings are signaled by escape sequences embedded in the text stream, so to determine the encoding, and in some cases to determine how many bytes to read as a single character, you need to process the entire text stream up to that point. The net result is that one transmission error or one misstep in applying encoding-translation software can garble an entire document. Unicode was born from frustration by software manufacturers with the fragmented, complicated, and contradictory character encodings in use around the world. The technical difficulties of dealing with different coded character sets meant that software had to be extensively localised before being released into different markets, which meant that the “other language” versions of software could be significantly delayed and also be significantly different internally because of the character handling requirements. Not surprisingly, the same character handling changes were regrafted onto each new version of the base software before it could be released into other markets. Of course the character encoding is not the only aspect to be changed when localizing software, but it is a large component, and it is the aspect most likely to have a custom solution for each new language. The Unicode work began in the late 1980s, and the Unicode Consortium was incorporated as Unicode, Inc. in 1991. Today the Consortium's members include many major software and hardware manufacturers. The current version of the Unicode Standard is version 2.1.9, and version 3.0 will be published in January 2000. Version 2.1 was issued as a Technical Report that lists additions and corrections from version 2.0 of the Standard. That document is available as a 950-page book that defines Unicode and lists every character with an example glyph. As an example of the problems of handling the world's scripts, in order to display every glyph, the version 2.0 book required specialised formatting software from five different companies. XML/SGML Asia Pacific 99 1 Unicode: What is it and how do I use it? • Tony Graham Version 3.0 is set to be an even larger book, since it defines over 10,000 more characters than Version 2.1, plus the new version promises a revised structure and more details on implementing support for the Unicode Standard. Table 1 summarizes the changes in the character set between Version 2.1 and Version 3.0. Table 1 Comparison of version 2.1 and version 3.0 of the Unicode Standard Category Alphabetics, Symbols CJK Ideographs Hangul Syllables Total assigned characters Private Use Surrogates Controls Not Characters Total assigned 16-bit code values Unassigned 16-bit code values 2.1 6511 21204 11172 38887 6400 2048 65 2 47402 18134 3.0 10236 27786 11172 49194 6400 2048 65 2 57709 7827 The Unicode Standard defines a fixed-width 16-bit, uniform encoding scheme for written characters and text. The Unicode Standard, Version 2.0, is also code-for-code identical with ISO/IEC 10646-1:1993 (plus the first seven amendments), and Version 3.0 will be code-forcode identical with the forthcoming ISO/IEC 10646-1:2000. The Unicode Standard, however, does define more semantics for characters than does ISO/IEC 10646. The Unicode Technical Committee has a liaison membership with the ISO/IEC Working Group responsible for computer character sets. Unicode Design Goals The original design goals of the Unicode Standard were: Universal The repertoire had to be large enough to encompass all characters likely to be used in general text interchange. Efficient Plain text, composed of a sequence of fixed width characters, is simple to parse, and software does not need to maintain state, look for special escape sequences, or search forward or backward through text to identify characters. Uniform A fixed-length character code allows efficient sorting, searching, display, and editing of text. Unambiguous Any given 16-bit value always represents the same character. Unicode Design Principles Building on the design goals, the design of the Unicode Standard reflects the following ten design principles. 2 XML/SGML Asia Pacific 99 Tony Graham • Unicode: What is it and how do I use it? Sixteen-Bit Characters All Unicode character codes are a uniform 16 bits. For compatibility with existing computer systems that can't handle 16-bit characters, the Unicode Standard defines the UTF-8 and UTF-7 formats for lossless transformation between Unicode characters and 8-bit and 7-bit character sequences, respectively. Full Encoding Using 16-bit characters allows over 65,000 code positions that can be assigned to characters. This was expected to far exceed the expected requirements for all modern and most archaic languages. However, as of Unicode 3.0, there are less than 8,000 unassigned code positions, and Unicode 4.0 will assign characters outside the 65,000 code positions in the “Basic Multilingual Plane”. Characters, Not Glyphs Unicode is concerned with characters, not glyphs. Characters are the smallest component of a written language, and glyphs are the representation of the written or displayed character. There is often a one-to-many or many-to-one relationship between characters and glyphs. Even in English, there are many different glyphs that can represent a single character (Figure 1), fonts that contain many glyphs for the same character (Figure 2), and in high-quality typesetting, single-glyph ligatures used for sequences of, for example, “f” and “i” or “f” and “l” characters (Figure 3). Figure 1. Will the real “a” please step forward? Figure 2. Ampersands Figure 3. “ffi” and “ffi” ligature XML/SGML Asia Pacific 99 3 Unicode: What is it and how do I use it? • Tony Graham Semantics The Standard provides well-defined semantics for each character, including numeric, spacing, combination, and directionality properties. The specification of these semantics is not included in ISO/IEC 10646, and the writers of the Unicode Standard see this as a major feature of Unicode over ISO/IEC 10646. Plain Text Plain Unicode is a sequence of character codes. Unicode does not currently define any codes for specifying language, font, etc., although there are character codes that provide hints about directionality of the text. In Unicode terms, SGML and XML are “fancy text” or “higher-level protocols” since their markup represents additional data structures interspersed in the stream of plain Unicode characters. The “plain text” nature of Unicode is set to change since a recent Unicode Technical Report [TR7 98] proposes the addition of language tags using characters on Plane 14 of ISO/IEC 10646. Since the language tags cannot be confused for textual content, parsing for them is, in theory, easier, than, say, using SGML or XML, and they are seen as a lighter-weight mechanism for specifying language. Logical Order Characters are stored in their logical order: in the sequence in which they are read, which is not always the sequence in which they are displayed. For example, Arabic and Hebrew are read right-to-left. Since the logical start of the right-to-left text is the character closest to the right margin, that character is the first character in the Unicode character stream, as shown in Figure 4. Display order Example of "tfel ot thgir" text Logical order Example of "right to left" text Figure 4. Bidirectional Ordering The characters' directionality properties, and use of character codes specifying changes in direction when mixing characters of different dominant direction, provide sufficient information for correct rendering of the text. In addition, the Unicode Standard specifies that combining marks always follow their base character. In contrast, existing encodings are not standardised, and some require the combining characters before the base character and some after. Unification Unicode unifies characters within scripts across languages so that characters with equivalent form are given a single code. For example, common letters, punctuation marks, symbols, and diacritics were each given one code. In addition, over 120,000 ideographs used in Chinese, 4 XML/SGML Asia Pacific 99 Tony Graham • Unicode: What is it and how do I use it? Japanese, and Korean were unified to 21,204 character codes (expanding to 27,786 in Version 3.0). The exceptions are “compatibility characters” that could have been unified but remained as separate code positions, often in support of round-trip mapping between Unicode and an existing code set. For example, ISO 8859-1 includes the character “Ö”, and Unicode includes “O” and “¨” plus “Ö” as a compatibility character even though “Ö” could be remapped to the two-character sequence without loss of information. Han Unification: The Downside The downside of the unification process is that a “Unicode” font would use the same glyph for the same character number, not matter what glyph variations existed for the “un-unified” forms of the character. At best, this leads to homogenised fonts that use generic representations of the characters and please no-one, and at worst this leads to the occasional “wrong” character when the font uses the wrong glyph variant for a user' locale. In principle, unifying the CJK ideographs is no different than, say, unifying the French “a” and the English “a” or encoding a single abstract “a” to represent all of the multiple “a” characters you saw in Figure 1. In practice, however, the different scripts have each had centuries in which to develop local variations on the abstract character, so perhaps the CJK unification is closer to unifying blackletter and sanserif forms of the same character. To illustrate this point, Figure 5 shows “knoll” written with three variants of the Lucida typeface. They all have a common root in the Lucida design, but if you are used to reading one variant, the fact that they have a common root won't make you instantly comfortable with the other forms. knoll knoll knoll Figure 5. “knoll” written with three variants of the Lucida typeface Table 2 shows how three ideographs rendered, where possible, in a Simplified Chinese, Traditional Chinese, Japanese and Korean font. Note how the characters are not all identical across scripts but nor are they all radically different. Table 2 Sample Simplified Chinese, Traditional Chinese, Japanese, and Korean glyphs Unicode Definition character number U+5E73 U+72AC U+7DF4 Simplified Chinese (MS Song) glyph Traditional Chinese (MingLiU) glyph Japanese (MS Korean Mincho) (GulimChe) glyph glyph Flat, level, even, peaceful Dog Practice, drill, – exercise, train It is universally acknowledged that extra information can and should be added to the document, using the proposed language tags or a “higher-level protocol” such as XML XML/SGML Asia Pacific 99 5 Unicode: What is it and how do I use it? • Tony Graham markup and the xml:lang attribute, to specify the correct language and/or locale for the text, and this can be used to key the selection of fonts for representing the characters. This does, however, lead to more complex software and detracts from the “Characters, not Glyphs” design principle. A benefit of the CJK ideograph unification is that the Unicode Standard incorporates characters for each of the characters in its primary sources, so a Unicode encoding can be used as a “hub” when transcoding from one source standard into another. Dynamic Composition We have already seen “O” and “¨” as an example of a base plus a combining character. Instead of allowing just “well-known” accented characters such “Ö”, Unicode allows dynamic composition of accented forms where any base character plus any combining character (or sequence of combining characters) can make an accented form. Equivalent Sequence Since some character may be represented in precomposed or dynamically composed forms, the Unicode Standard defines equivalent sequences for each precomposed form. Since sequences of combining characters can follow a base character, the Standard also defines a canonical ordering for combining characters. Note that the Unicode Standard does not prescribe one particular internal representation of composed characters or one particular sequence of combining characters. Systems may choose to normalise Unicode text to one particular representation, and the W3C work on “Requirements for String Identity Matching and String Indexing” may standardise that for the entire Web, but in Unicode all sequences of characters are permitted. Convertibility Round-trip conversion between Unicode and many pre-existing standards is possible since each character has a unique correspondence with a sequence of one or more Unicode characters. When a base standard includes multiple variant forms of a single character, the variants are not unified so there will always be a mapping between Unicode and the base standard. While accurate convertibility is guaranteed, many conversions require a mapping table since the corresponding Unicode characters may not be in the same sequence as in the base standard or a base standard character may map to a sequence of Unicode characters. Guaranteeing convertibility has required compromises such as the inclusion of many compatibility characters, but this does mean that Unicode can become a replacement or alternative for these pre-existing standards. It also means that an application can read or write using a pre-existing coded character set but be “Unicode inside”. Unicode and ISO/IEC 10646 The Unicode Standard, Version 2.0, is code-for-code compatible with ISO/IEC 106461:1993, Information Technology – Universal Multiple-Octet Coded Character Set (UCS) – Part 1: Architecture and Basic Multilingual Plane (plus its first seven amendments). ISO/IEC 10646 is a four-octet (32-bit) coded character set, although it currently only specifies characters in the Basic Multilingual Plane (BMP), and these can be represented with 6 XML/SGML Asia Pacific 99 Tony Graham • Unicode: What is it and how do I use it? 16-bit characters. The BMP includes characters in general use in alphabetic, syllabic, and ideographic scripts together with various symbols and digits. ISO/IEC 10646 specifies the names and coded representation of these graphic characters. In addition, two coded representations are specified: UCS-4, the four-octet (32-bit) form of the UCS, and UCS-2, the two-octet (16-bit) BMP form of the UCS. In ISO/IEC 10646 terms, the BMP is “Plane 00 of Group 00”. Characters outside the BMP will require more than 16 bits for their coded representations. The ISO/IEC Working Group and the Unicode Technical Committee have not begun the formal process of creating planes outside the BMP. There are, however, many proposals for assignments in Plane 01, plus Plane 02 is reserved for more CJK ideographs (with possible overflow into Plane 03), and Planes 15 and 16 are reserved for private use. It is expected that, even with two planes reserved for private use, that it will never be necessary expand beyond the 17 planes addressable with UTF-16. Timeline The Unicode Standard and ISO/IEC 10646 began as independent efforts. They merged in 1992, and today the Unicode Technical Committee and the ISO/IEC Working Group operate in parallel. Table 3 Unicode and ISO/IEC 10646 timeline 1989 1990 1990 1991 1992 1992 1993 1993 1995 1996 1998 2000 DP 10646 published, independent of Unicode Unicode 1.0 published DIS-1 10646, independent of Unicode Unicode and ISO 10646 agree to merge Unicode 1.0.1, modified for merger DIS-2 10646, merged with Unicode IS 10646-1:1993, merged standard Unicode 1.1, revised to match 10646-1:1993 10646 amendments Unicode 2.0, revised to cover amendments Unicode 2.1, corrected errors, added Euro Unicode 3.0, ISO/IEC 10646-1:2000, possibly ISO/IEC 10636-2:2000 Unicode's Extra Semantics To ISO/IEC 10646, the Unicode Standard is a “profile” that implements certain portions of ISO/IEC 10646. In particular, Unicode supports only the first 16 planes, and counts as “implementation level 3” since it includes both combining marks and precomposed characters. To Unicode, ISO/IEC 10646 is a superset of the Unicode Standard since it can include more characters than can Unicode. However, the Unicode Standard defines additional character semantics beyond those defined in ISO/IEC 10646, which the Unicode Standard stresses as the major difference between the two standards. In ISO/IEC 10646, the character name is the main resource for character semantics and crossmapping between standards. The character names are identical in the Unicode Standard, but XML/SGML Asia Pacific 99 7 Unicode: What is it and how do I use it? • Tony Graham that Standard adds or includes alias names, usage annotations, character properties, conformance specifications, and tables of explicit mappings between the Unicode Standard and pre-existing standards. What in the World is Included? Table 4 lists the character blocks in the Unicode Standard, Version 3.0. Note that some block names are different in ISO/IEC 10646-1:1993. Table 4 Start End 0000 007F 0080 00FF 0100 017F 0180 024F 0250 02AF 02B0 02FF 0300 036F 0370 0400 03FF 04FF 0530 0590 0600 058F 05FF 06FF 0700 074F 0780 0900 0980 07BF 097F 09FF 0A00 0A7F 0A80 0AFF 0B00 0B7F 0B80 0BFF 0C00 0C7F 8 Character blocks in the Unicode Standard, Version 3.0 Name Basic Latin. This is the same as ASCII (ISO 646-IRV). Latin-1 Supplement. This is the same as the corresponding character numbers in ISO 8859-1. Latin Extended-A. Additional letters that extend the Basic Latin and Latin-1 Supplement blocks to support additional languages such as Afrikaans, Polish, and Welsh. Latin Extended-B. Additional letters required supporting additional languages such as Croatian. IPA Extensions. Symbols of the International Phonetic Alphabet (IPA) that are not encoded elsewhere. Spacing Modifier Letters. Signs indicating modification of a preceding letter principally used in phonetics. Combining Diacritical Marks. Diacritical marks for use with any script. Remember that in the Unicode Standard, combining characters can be used with any base character. Greek. Modern Greek, archaic Greek, and Coptic letters. Cyrillic. Letters for Slavic languages, including some historic Cyrillic letters rarely used in modern forms. Armenian. Used primarily for writing the Armenian language. Hebrew. Used for Hebrew, Yiddish, Judezmo, and other languages. Arabic. Used for Arabic, and also for other languages such as Persian and Urdu. Syriac. Language that grew out of Aramaic and now mainly used as the liturgical language for certain Christian churches. Thaana. Script for the Dhivehi language used in the Maldives. Devanagari. Used for classical Sanskrit and its modern derivatives. Bengali. A North Indian script derived from Sanskrit and used for Bengali plus other languages such as Assamese. Gurmukhi. Used to write the Punjabi language. Gujarati. Used to write Gujarati and Kacchi. Oriya. Another North Indian script related to Devangari that is used for the Oriya language and also for Khondi and Santali. Tamil. Used for the Tamil language, which is used in southern India, Sri Lanka, Singapore, and parts of Malaysia. Telugu. Used for the Telugu language of Andhra Pradesh state in India and also for minority languages such as Gondi and Lambadi. XML/SGML Asia Pacific 99 Tony Graham • Unicode: What is it and how do I use it? Start End Name 0C80 0CFF Kannada. Used for the Kannada language of Karnataka state in India and also for minority languages such as Tulu. 0D00 0D7F Malayalam. Used for the Malayalam language of Kerala state in India. 0D80 0DFF Sinhala. Used for Sinhalese, the official language of Sri Lanka. 0E00 0E7F Thai. Used for the Thai language and other languages such as Kuy, Lavna, and Pali. 0E80 0EFF Lao. Used for Lao, the official language of Laos. 0F00 0FFF Tibetan. Used for Tibetan and other languages such as Ladakhi and Lahuli. 1000 109F Myanmar. Used for Burmese and other languages spoken in Myanmar such as Mon, Karen, and Shon. 10A0 10FF Georgian. Used for the language of Georgia on the southeastern corner of the Black Sea. 1100 11FF Hangul Jamo. Alphabetic components that are combined, or conjoined, to form the syllables of the Hangul script for the Korean language. 1200 137F Ethiopic. Used for Amharic and other languages of Ethiopia 13A0 13FF Cherokee. Syllabary of the Cherokee language. 1400 167F Unified Canadian Aboriginal Syllabics. Syllabic characters for languages such as Cree, Ojibwe, Athabaskan, and Inuit. 1680 169F Ogham. Archaic Irish script. 16A0 16FF Runic. Archaic Norse script. 1780 17FF Khmer. Script for the national language of Cambodia. 1800 18AF Mongolian. Script for the Mongolian language. 1E00 1EFF Latin Extended Additional. Precomposed combinations of Latin letters included in the Unicode Standard for compatibility with the developing ISO 10646 standard. 1F00 1FFF Greek Extended. Precomposed combinations of Greek letters included in the Unicode Standard for compatibility with the developing ISO 10646 standard. 2000 206F General Punctuation. Punctuation characters, some of which are used by many different scripts. 2070 209F Superscripts and Subscripts. Superscript and subscript characters included for compatibility with existing character sets. 20A0 20CF Currency Symbols. Currency symbols not encoded in other blocks. 20D0 20FF Combining Marks for Symbols. Diacritical marks that are generally applied to mathematical or technical symbols. 2100 214F Letterlike Symbols. Symbols derived from letters of alphabetic scripts. 2150 218F Number Forms. Number forms, e.g. roman numerals, included for compatibility with existing character sets. 2190 21FF Arrows. 2200 22FF Mathematical Operators. 2300 23FF Miscellaneous Technical. 2400 243F Control Pictures. Representations of the names of the C0 control characters, and the space, delete, and newline characters. 2440 245F Optical Character Recognition. OCR-A characters that are not otherwise encoded and MICR symbols used in check processing. XML/SGML Asia Pacific 99 9 Unicode: What is it and how do I use it? Start End 2460 24FF 2500 257F 2580 259F 25A0 25FF 2600 2700 26FF 27BF 2800 2E80 28FF 2EFF 2F00 2FDF 2FF0 2FFF 3000 303F 3040 309F 30A0 30FF 3100 312F 3130 318F 3190 319F 31A0 31BF 3200 32FF 3300 33FF 3400 4DB5 4E00 9FFF A000 A48F A490 AC00 D800 DB80 10 A4CF D7A3 DB7F DBFF • Tony Graham Name Enclosed Alphanumerics. Enclosed numbers and letters included for compatibility with existing standards. Box Drawing. Box drawing characters included for compatibility with existing standards. Block Elements. Graphic characters representing various filled or stated blocks included for compatibility with existing standards. Geometric Shapes. Graphic characters representing various geometric shapes. Miscellaneous Symbols. Dingbats. Characters from the well-known ITC Zapf Dingbats font that are not otherwise encoded. Braille Patterns. Braille pattern symbols. CJK Radicals Supplement. Additional radicals not included in the 214 Kangxi radicals. Kangxi Radicals. 214 characters for the fundamental components Han ideographs. Ideographic Description Characters. Characters used for describing a one layout of ideographic character when the correct character is not available. CJK Symbols and Punctuation. Punctuation marks and symbols used in China, Japan, and Korea. Hiragana. The hiragana syllabary used to write Japanese. Katakana. The katakana syllabary used to write loan-words in Japanese. Bopomofo. Phonetic characters used for Chinese, principally the Mandarin language. Hangul Compatibility Jamo. Additional jamo included for compatibility with KSC 5601. Kanbun. Literally "Chinese writing by/for Japanese", these are marks used in Japanese text to indicate the Japanese reading order of classical Chinese text. Bopomofo Extended. Additional Bopomofo characters. Enclosed CJK Letters and Months. Characters with enclosing circles, boxes, or parentheses included for compatibility with existing standards. CJK Compatibility. Characters included for compatibility with existing standards. CJK Unified Ideographs Extension A. Additional ideographs added with Version 3.0. CJK Unified Ideographs. Han ideographs unified from Chinese, Japanese, and Korean sources. Yi Syllables. Yi, also known as Lolo, is a script resembling Chinese in overall shape that is used in the Yunnan province of China. Yi Radicals. Basic units of the Yi syllables. Hangul Syllables. Composed syllables for the Korean writing system. High Surrogates. The first code value in a surrogate pair. High Private Use Surrogates. The first code value in a surrogate pair that addresses the private use areas that are accessible with UTF-16. XML/SGML Asia Pacific 99 Tony Graham • Unicode: What is it and how do I use it? Start End Name DC00 DFFF Low Surrogates. The second code value in a surrogate pair. E000 F8FF Private Use. Reserved for characters defined by private agreement between users. F900 FAFF CJK Compatibility Ideographs. Ideographs included for compatibility with existing standards. FB00 FB4F Alphabetic Presentation Forms. Presentation forms of Armenian, Hebrew, and Latin characters included for compatibility with existing standards. FB50 FDFF Arabic Presentation Forms-A. Presentation forms—for example, contextual variants, ligatures, and ornate forms—of Arabic characters included for compatibility with existing standards. FE20 FE2F Combining Half Marks. Pairs of characters that are each the presentation form of one half of a combining mark that applies to multiple base characters. FE30 FE4F CJK Compatibility Forms. Presentation forms of ideographs that are already encoded by the standard. FE50 FE6F Small Form Variants. Small variants of ASCII punctuation characters included for compatibility with CNS 11643. FE70 FEFE Arabic Presentation Forms-B. Additional Arabic presentation forms. FEFF FEFF Specials. The ZERO WIDTH NO-BREAK SPACE character, which also has significance as an encoding form signature. FF00 FFEF Halfwidth and Fullwidth Forms. Presentation forms included for round-trip mapping with existing CJK character sets. FFF0 FFFD Specials. Characters with special significance for Unicode processing. What's in the Works? Table 5 lists planned or proposed scripts from around the world (and out of this world). Table 5 Scripts proposed for inclusion in the Unicode Standard Name Avestan Basic Egyptian Hieroglyphics Blisssymbolics Brahmi Buginese Byzantine Musical Symbols Cham Cirth Coptic Cypriot Description An ancient Iranian language that was the sacred language of the Zoroastrians. Used by scholars to write Ancient Egyptian. A symbol system developed by Charles Bliss as a visual supplement to speech. Script used in ancient India. Language spoken on the island of Celebes in Indonesia. Symbols used in the Byzantine musical tradition One of the minor languages spoken in Vietnam. Script invented by J. R. R. Tolkien and used in Lord of the Rings and The Silmarillion. Liturgical language of the Coptic Church. Syllabary for the Cypriot dialect of Greek that was used from about 800 XML/SGML Asia Pacific 99 11 Unicode: What is it and how do I use it? Name Syllabary Deseret Alphabet Etruscan Glagiotic Gothic Javanese Linear B Meroitic Old Hungarian Runic Old Permic Old Persian Cuneiform Phillipine Scripts Phoenician Pollard • Tony Graham Description BC to 200 BC. Phonetic alphabet for English used by The Church of Jesus Christ of Latter-day Saints. Created in the 19th century, it has never been widely used. Script for the Etruscan and Oscan languages. The script was used from the 7th century BC to the first century AD. Alphabet ascribed to St Cyril that was formerly used for some Slavic alphabets. Phonetic alphabet used by the Germanic tribe of the Goths that was invented in the fourth century by the Gothic bishop, Wulfila. Language spoken in central and eastern Java. No connection with the computer language. Oldest known Greek syllabic system that was used on Crete until at least 1375 BC. Script used by the ancient Meroites in what is now Sudan. Runes A language from northeastern Russia Used by the scholarly community to write Old Persian. Scripts for writing Tagalog, Hanunóo, Buhid, and Tagbanwa. Script of ancient Phoenicia. Script invented by Samuel Pollard and used for about a dozen languages in Southeast Asia. Rong Script based on Tibetan writing that was devised in 1720. Shavian Phonetic alphabet designed by Kingsley Read that won a competition for a new alphabet established under the terms of George Bernard Shaw's will. Also known as “Shaw's alphabet” and the “Proposed British Alphabet”. Sinaitic An ancient Semitic script South Arabian Ancient script from southern Arabia. Soyombo Alphabetic script invented in 1686 for writing Mongolian, Tibetan, and Sanskrit. Tai (Dai) scripts A family of languages from Southeast Asia. Thai and Lao are both offshoots of Tai. Tengwar Script invented by J. R. R. Tolkien and used in Lord of the Rings and The Silmarillion. Tifinagh Writing system consisting only of consonants that is used by the Tuareg (Berber) tribesmen. tlhingan Hol Klingon script. Ugaritic Used by the scholarly community to write Ugaritic. Cuneiform Western Musical Symbols from the Western musical tradition. Symbols 12 XML/SGML Asia Pacific 99 Tony Graham • Unicode: What is it and how do I use it? Is it a Universal Character Set? As we have seen, the Unicode Standard and ISO/IEC 10646 do support a multitude of scripts, but their estimated success or failure in representing the major scripts of the world varies depending on whose opinion you ask. To many in Asia, the answer is a resounding “No”, largely because of dissatisfaction with the Han unification process that merged Chinese, Japanese, and Korean ideographs. Unicode supports many base standards as they existed in 1993 (and Version 3.0 will align the Unicode Standard with newer versions of some of those standards), but the field of language is not static, and new characters are continually being created. According to [Huang and Huang 89], the Chinese are inventing many new characters each year, and it remains to be seen how well new characters are added to the repertoire. In addition, Unicode does not yet support some modern scripts, although some unsupported scripts are for languages that are commonly written with other, supported scripts. While Unicode also doesn't support many archaic and obsolete scripts, several of these are proposed for future inclusion. Who Uses Unicode or ISO/IEC 10646? Both the list of software that supports Unicode and the list of applications that require Unicode support are continually growing. On the software side, this includes: • Java • MacOS • Windows NT • Windows 95 (partial) • AIX • Plan 9 • NetWare 4.0 • QuickDraw GX • Jade • nsgmls and some other SGML parsers • XML applications • Tcl/Tk • Perl • ECMAScript/Javascript/JScript • Visual Basic and VBScript Applications that require Unicode support include: • XML • HTML 4.0 • Java XML/SGML Asia Pacific 99 13 Unicode: What is it and how do I use it? • Tony Graham How do I specify “@´" The character set may be standardised, but a character's representation in programs and data VWLOOYDULHVZLWKWKHDSSOLFDWLRQ7DEOHVKRZVVRPHRIWKHZD\VWKDWWKH³@´FKDUDFWHULV represented. Table 6 @ +RZ 'R , 6SHFLI\ ³ ´" Standard or Application Representation Comment Unicode U+0416 An individual Unicode value is expressed as U+nnnn, where nnnn is the character's number expressed in hexadecimal notation. CYRILLIC CAPITAL LETTER ZHE All Unicode characters have unique names, which are the same as those in the English language version of ISO/IEC 10646. Names contain only the uppercase letters A to Z, space, hyphen-minus, and occasionally digits. As this table shows, this makes it easy to generate computer-sensible identifiers for individual characters. 0000 0416 UCS-4 0416 UCS-2 CYRILLIC CAPITAL LETTER ZHE ISO/IEC 10646 assigns a unique name to each character. Java \u0416 Java defines a special Unicode escape sequence for representing the hexadecimal value of a Unicode character. Pre-TC SGML Ж Prior to the WebSGML Adaptations TC, SGML allowed only decimal numeric character references. XML and WebSGML Adaptations SGML Ж XML, and SGML after the WebSGML Adaptations TC, allows both decimal and hexadecimal numeric character references. The numeric references do not contain a “u” or otherwise signal that “this is Unicode” since they are references to character numbers in the document character set and, in principle, any single-byte or multi-byte coded character set could be used in an SGML application. ISO/IEC 10646 Ж SGML Declaration 1046 Decimal character numbers only may be used. Ж Decimal numeric character references may be used in minimum literals DSSSL #\cyrillic-capital-letter-zhe; Characters may be referenced by an identifier derived from the character name. #\U-0416 Jade also supports of the form “U-” plus the hexadecimal character number. How do I use Unicode with SGML? After the fundamental step of selecting SGML software that supports Unicode, you can then reference ISO/IEC 10646 as the BASESET in the CHARSET declaration in your SGML 14 XML/SGML Asia Pacific 99 Tony Graham • Unicode: What is it and how do I use it? Declaration. The following CHARSET declaration is taken from the SGML Declaration for XML: CHARSET BASESET "ISO Registration Number 176//CHARSET ISO/IEC 10646-1:1993 UCS-2 with implementation level 3//ESC 2/5 2/15 4/5" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED 128 32 UNUSED 160 65376 160 You should also refer to ISO/IEC 10646 as your syntax reference character set, and, additionally, allow non-ASCII characters in names. How do I use Unicode with XML? All XML processors are required to support both the UTF-8 and UTF-16 encodings of Unicode characters. XML processors may also support other encodings, and the XML Recommendation lists some of them, but an XML document or parsed entity that is not in either UTF-8 or UTF-16 must begin with an XML Declaration or text declaration that specifies the encoding, for example, <?xml encoding="EUC-JP"?>. To distinguish between the two encodings that do not need an encoding declaration, parsed entities in the UTF-16 encoding must begin with the Byte-Order Mark (BOM). The BOM is used in Unicode systems as a signature for detecting the sequence of bytes in each 16-bit character. When the first two bytes in the parsed entity are FE16 and FF16, then the bytes are in the correct order for the processor, and when the first two bytes are FF16 and FE16, the byte pairs need to be transposed before interpretation as Unicode characters. The BOM is unnecessary for UTF-8, the 8-bit encoding of Unicode characters, so its presence is used in XML systems to indicate the UTF-16 encoding. XML parsed entities may use any legal graphic character specified in the Unicode Standard and ISO/IEC 10646. Any Unicode character that is not allowed in the current encoding may be included by a numeric reference to that character's number in the ISO/IEC 10646 repertoire. Once characters outside the Basic Multilingual Plane become part of Unicode and ISO/IEC 10646, you will be able to represent them in both UTF-8 and UTF-16. In UTF-16, you reference the characters using a two-character sequence called a “Surrogate Pair”. If you can't directly encode the characters outside the BMP, you can make a numeric character reference to their character number. You cannot reference characters by pairs of numeric references to the corresponding Unicode Surrogate Pair code values. As stated previously, one of Unicode's design principles is “Equivalent Sequence” between precomposed characters and their canonical or compatibility decompositions. By default, however, XML does not consider that precomposed characters and their compatibility decompositions are equal in string comparisons. The XML recommendation does state that “at user option, processors may normalise such characters to some canonical form”, but a quick survey doesn't show any XML parsers implementing this user option. XML/SGML Asia Pacific 99 15 Unicode: What is it and how do I use it? • Tony Graham For example, according to the Unicode standard, LATIN SMALL LETTER O + COMBINING DIAERESIS (o + ¨) and LATIN SMALL LETTER O WITH DIAERISIS (ö) are canonical equivalents, but if they were used as the name in an element's start-tag and endtag, respectively, this would cause an error because they are not equivalent according to the XML recommendation (Figure 6). Figure 6. Canonical equivalents are not equivalent to XML This is more likely to happen, however, if the two occurrences of the name or string were in two different entities, such as a document and its DTD, created using different editors that output different, but canonically-equivalent under Unicode, character sequences for the same text. It remains to be seen whether the forthcoming W3C work on string identity matching obviates this difference from Unicode before it becomes a problem for XML. It may help you with XML data sent over the Web, but it probably won't save you from causing yourself problems with different software on your own computer. Can I use ISO Entities with Unicode? Annex D of ISO 8879 declared several sets of general entities, and these have been in wide use over the last thirteen years. ISO TR 9573 superseded Annex D and declared more sets of general entities. Despite this longevity and comparatively widespread use, Unicode and ISO/IEC 10646 do not include characters for all of the ISO entities. If you want to stay with the official character repertoire, XML-compatible versions prepared by Rick Jelliffe of several ISO entity sets are available at http://www.altheim.com/specs/charents.html. The W3C Mathematical Markup Language (MathML) effort needed all of the ISO TR 9573 entities and more besides, so the MathML recommendation [MathML 99] defines characters for the additional entities and, for now, standardises character numbers for these characters in the user-defined area of Unicode and ISO/IEC 10646. The recommendation states that the “STIX project of the STIPUB group of scientific and technical publishers” is working both on making available a complete set of fonts for Unicode characters for science and technology and on proposing inclusion of these characters in Unicode and ISO/IEC 10646. The STIPUB Consortium, working through the American Mathematical Society, submitted a proposal for additional math symbols and technical symbols to the ISO/IEC working group in March 1998 [WGReg 98], but additional math symbols and technical symbols are not currently listed among Unicode's proposed scripts [Prop 99]. Details of the characters, including lists organised by entity name and by Unicode character number, are available at [MathML 99]. The entities are current, revised version of the MathML DTD [MathML DTD]. Since the character numbers for these additional characters are in the private use area, it is best to take the advice of the MathML recommendation and 16 XML/SGML Asia Pacific 99 Tony Graham • Unicode: What is it and how do I use it? refer to these characters by entity name only, as the character numbers will change if the characters are officially included in Unicode and ISO/IEC 10646. What are the Unicode and ISO/IEC 10646 Coded Representations? The Unicode Standard and ISO/IEC 10646 share a bewildering variety of coded representations for the same characters. The following sections will help in deciphering the acronyms. UCS-4 – Four-Octet (32-bit) Canonical Form of the UCS UCS-4 is the four-byte representation of characters in ISO/IEC 10646. It is not used in the Unicode Standard. It can represent more characters than the Unicode Standard intends to ever represent. UCS-2 – Two-Octet (16-bit) BMP Form of the UCS UCS-2 is the two-byte representation of characters defined in ISO/IEC 10646, but the Unicode Standard, Version 2.0, is compatible with UCS-2. The Unicode Standard, Version 3.0, will outgrow UCS-2 when it declares planes 15 and 16 as available for private use. UTF-7 – UCS Transformation Format, 7-Bit Form UTF-7 is defined in IETF RFC 1642, and is summarised in an Appendix to the Unicode Standard 2.0, but it is not part of ISO/IEC 10646. UTF-7 is a 7-bit form that is safe for use with email, Simple Mail Transport Protocol (SMTP), and Multimedia Internet Mail Extensions (MIME). Each 16-bit Unicode character is encoded as between 1 and 2] 7-bit characters. UTF-7 is sometimes considered to be obsolete, but there are still 7-bit mail systems in use around the world. UTF-8 – UCS Transformation Format, 8-Bit Form UTF-8 is common to both the Unicode Standard and ISO/IEC 10646, and all XML processors are required to support this format. Originally developed by X/Open to enable use of Unicode character data in 8-bit Unix environments, this transformation format was known as File System Safe UTF (FSS-UTF) and as UTF-2 before being renamed UTF-8 and included as a normative addendum to ISO/IEC 10646. UTF-8 encodes each Unicode character as one, two, or three bytes (and UCS-4 characters as up to six bytes). All of the ASCII code values are represented as a single byte, most nonideographic characters are represented as two-byte sequences, and the remaining Unicode characters are represented as three-byte sequences. Once characters outside the BMP are defined, the Surrogate Pair codes that Unicode uses to represent these characters will be represented as four-byte sequences UTF-8 is a simple and efficient conversion and a reasonably compact encoding, and it is compatible with ASCII since the ASCII characters map to the same coded representation in UTF-8. However, many Japanese, for example, do not like UTF-8 since their files become bigger when the current double-byte characters are converted to three bytes of UTF-8. XML/SGML Asia Pacific 99 17 Unicode: What is it and how do I use it? • Tony Graham UTF-16 – UCS Transformation Format for Planes of Group 00 UTF-16 is an ISO/IEC 10646 encoding that is equivalent to the Unicode Standard with the use of surrogates. The only difference between UCS-2 and UTF-16 is the use of surrogates with UTF-16. However, while XML specifies support for UTF-16, and UTF-16 includes the code points for surrogate pair characters, the character numbers for the code points in the surrogate blocks are not legal XML character numbers. Surrogate pairs in a UTF-16 document will be refer to a non-UCS-2 character, but numeric character references to a character number in the surrogate blocks are not legal XML. Use of code points in the surrogate blocks, or the S (Special) Zone of the BMP as it is called in ISO/IEC 10646, enables addressing of characters in planes 1…1610 of group 0 of UCS-4; that is, it allows addressing of over one million additional UCS-4 code values in the range 00010000…0010FFFF16. The Unicode Consortium now refers to UTF-16, UTF-16BE, and UTF-16LE. UTF-16 is UTF-16 with the Byte-Order Mark (BOM) to indicate the order of the two bytes of a UTF-16 character. UTF-16BE is UTF-16 without the BOM and with the most significant byte first. It is the “native” UTF-16 format for big-endian processors. UTF-16LE is UTF-16 without the BOM and with the least significant byte first. It is the “native” format for little-endian processors such as Pentiums. UTF-32 This is a new 32-bit format published by the Unicode Consortium. UTF-32 characters are identical to UCS-4 characters over the UTF-32 character range, but, by design, UTF-32 only addresses the characters in planes 00 to 16 that you can represent using UTF-16. In addition, UTF-32 characters are specified as having the full Unicode character semantics, which you cannot guarantee for text labelled as UCS-4. Which is best for me? There is no single, simple answer to this question. The choice of encoding will depend in part upon your language and in part upon the tools that you are using. For example, if you are working in English, it is simplest to use UTF-8 since UTF-8 is a superset of ASCII, but if you are working in Japanese, it would be preferable to use UTF-16, since using UTF-8 would result in larger files. However, if you working with Perl, the latest version of which handles Unicode characters as UTF-8 internally, it might be simplest to only use UTF-8 for input and output. Your choice may be determined by your choice of application, since even on “little-endian” Intel processors, some applications save Unicode text as UTF-16BE. The good news is that recent versions of web browsers can handle both UTF-8 and UTF-16 text, as can your XML software. Further Information Definitive information on the Unicode Standard is available from the Unicode Consortium at http://www.unicode.org. Information on ISO/IEC 10646 is available at http://www.dkuug.dk/jtc1/sc2/wg2/. 18 XML/SGML Asia Pacific 99 Tony Graham • Unicode: What is it and how do I use it? Bibliography [Huang and Huang 89] Huang, Jack K. T., and Timothy D. Huang. 1989. An Introduction to Chinese, Japanese, and Korean Computing. Singapore: World Scientific [MathML 99] Mathematical Markup Language, Chapter 6, Entities, Characters, and Fonts, http://www.w3.org/TR/REC-MathML/chapter6.html [MathML DTD] Mathematical Markup Language, Appendix A, The MathML DTD, http://www.w3.org/TR/REC-MathML/appendixA.html [Prop 99] Unicode Consortium, Proposed New Scripts, http://www.unicode.org/pending/pending.html [TR7 98] Unicode Consortium, Technical Report #7, Plane 14 Characters for Language Tags, http://www.unicode.org/unicode/reports/tr7.html [WGReg 98] ISO/IEC JTC 1/SC 2/WG 2, N1750, Partial Document Register (N1600 N1810) + standing documents, http://wwwold.dkuug.dk/JTC1/SC2/WG2/docs/documents Biography Tony Graham Senior Consultant Mulberry Technologies, Inc. 17 West Jefferson Street, Suite 207 Rockville MD 20850 U.S.A. Phone: 301/315-9631 E-mail: [email protected] Web: http://www.mulberrytech.com Tony Graham is a Specialist Member of the Unicode Consortium and author of the forthcoming Unicode: A Primer. He has been keenly interested in character set issues since 1991 when he joined Uniscope, Inc. in Tokyo, Japan, where he developed SGML applications in Chinese, English, Japanese, and Korean. He is currently a Senior Consultant with Mulberry Technologies, Inc., an SGML and XML Consultancy specialising in training and design. Tony has spoken on Unicode to conference and user group audiences, and has written on Unicode for MultiLingual Computing & Technology. XML/SGML Asia Pacific 99 19
© Copyright 2026 Paperzz