Unicode: What is it and how do I use it?

Unicode: What is it and how do I use it?
The rationale for Unicode and its design goals and detailed design principles are presented. The
correspondence between Unicode and ISO/IEC 10646 is discussed, the scripts included or planned for inclusion
in the two character set standards are listed. Some products that support Unicode and some applications that
require Unicode are listed, then examples of how to specify Unicode characters in a variety of applications are
given. Use of Unicode in SGML and XML applications is discussed, and the paper concludes with descriptions
of the character encodings used with Unicode and ISO/IEC 10646, plus sources of further information are
listed.
Tony Graham, Senior Consultant, Mulberry Technologies, Inc.
What is Unicode?
Unicode is a character encoding standard published by, of all people, the Unicode
Consortium. Unicode is designed to include all of the major scripts of the world in a simple
and consistent manner. The Unicode Standard also defines properties of the characters and
algorithms for use in Unicode implementations.
Existing character encodings are a mixture of one-byte and two-byte encodings—Chinese,
Japanese, and Korean encodings need two bytes per character, and most other coded
character sets use only one byte per character. Most coded character sets are more or less
compatible with ASCII for the first 127 character numbers, but above that it's anybody's
game. Changes between character encodings are signaled by escape sequences embedded in
the text stream, so to determine the encoding, and in some cases to determine how many
bytes to read as a single character, you need to process the entire text stream up to that point.
The net result is that one transmission error or one misstep in applying encoding-translation
software can garble an entire document.
Unicode was born from frustration by software manufacturers with the fragmented,
complicated, and contradictory character encodings in use around the world. The technical
difficulties of dealing with different coded character sets meant that software had to be
extensively localised before being released into different markets, which meant that the
“other language” versions of software could be significantly delayed and also be significantly
different internally because of the character handling requirements. Not surprisingly, the
same character handling changes were regrafted onto each new version of the base software
before it could be released into other markets. Of course the character encoding is not the
only aspect to be changed when localizing software, but it is a large component, and it is the
aspect most likely to have a custom solution for each new language.
The Unicode work began in the late 1980s, and the Unicode Consortium was incorporated as
Unicode, Inc. in 1991. Today the Consortium's members include many major software and
hardware manufacturers. The current version of the Unicode Standard is version 2.1.9, and
version 3.0 will be published in January 2000.
Version 2.1 was issued as a Technical Report that lists additions and corrections from version
2.0 of the Standard. That document is available as a 950-page book that defines Unicode and
lists every character with an example glyph. As an example of the problems of handling the
world's scripts, in order to display every glyph, the version 2.0 book required specialised
formatting software from five different companies.
XML/SGML Asia Pacific 99
1
Unicode: What is it and how do I use it?
• Tony Graham
Version 3.0 is set to be an even larger book, since it defines over 10,000 more characters than
Version 2.1, plus the new version promises a revised structure and more details on
implementing support for the Unicode Standard. Table 1 summarizes the changes in the
character set between Version 2.1 and Version 3.0.
Table 1
Comparison of version 2.1 and version 3.0 of the Unicode
Standard
Category
Alphabetics, Symbols
CJK Ideographs
Hangul Syllables
Total assigned characters
Private Use
Surrogates
Controls
Not Characters
Total assigned 16-bit code values
Unassigned 16-bit code values
2.1
6511
21204
11172
38887
6400
2048
65
2
47402
18134
3.0
10236
27786
11172
49194
6400
2048
65
2
57709
7827
The Unicode Standard defines a fixed-width 16-bit, uniform encoding scheme for written
characters and text. The Unicode Standard, Version 2.0, is also code-for-code identical with
ISO/IEC 10646-1:1993 (plus the first seven amendments), and Version 3.0 will be code-forcode identical with the forthcoming ISO/IEC 10646-1:2000. The Unicode Standard, however,
does define more semantics for characters than does ISO/IEC 10646. The Unicode Technical
Committee has a liaison membership with the ISO/IEC Working Group responsible for
computer character sets.
Unicode Design Goals
The original design goals of the Unicode Standard were:
Universal
The repertoire had to be large enough to encompass all
characters likely to be used in general text interchange.
Efficient
Plain text, composed of a sequence of fixed width characters, is
simple to parse, and software does not need to maintain state,
look for special escape sequences, or search forward or
backward through text to identify characters.
Uniform
A fixed-length character code allows efficient sorting,
searching, display, and editing of text.
Unambiguous
Any given 16-bit value always represents the same character.
Unicode Design Principles
Building on the design goals, the design of the Unicode Standard reflects the following ten
design principles.
2
XML/SGML Asia Pacific 99
Tony Graham
• Unicode:
What is it and how do I use it?
Sixteen-Bit Characters
All Unicode character codes are a uniform 16 bits. For compatibility with existing computer
systems that can't handle 16-bit characters, the Unicode Standard defines the UTF-8 and
UTF-7 formats for lossless transformation between Unicode characters and 8-bit and 7-bit
character sequences, respectively.
Full Encoding
Using 16-bit characters allows over 65,000 code positions that can be assigned to characters.
This was expected to far exceed the expected requirements for all modern and most archaic
languages. However, as of Unicode 3.0, there are less than 8,000 unassigned code positions,
and Unicode 4.0 will assign characters outside the 65,000 code positions in the “Basic
Multilingual Plane”.
Characters, Not Glyphs
Unicode is concerned with characters, not glyphs. Characters are the smallest component of a
written language, and glyphs are the representation of the written or displayed character.
There is often a one-to-many or many-to-one relationship between characters and glyphs.
Even in English, there are many different glyphs that can represent a single character (Figure
1), fonts that contain many glyphs for the same character (Figure 2), and in high-quality
typesetting, single-glyph ligatures used for sequences of, for example, “f” and “i” or “f” and
“l” characters (Figure 3).
Figure 1.
Will the real “a” please step forward?
Figure 2.
Ampersands
Figure 3.
“ffi” and “ffi” ligature
XML/SGML Asia Pacific 99
3
Unicode: What is it and how do I use it?
• Tony Graham
Semantics
The Standard provides well-defined semantics for each character, including numeric, spacing,
combination, and directionality properties.
The specification of these semantics is not included in ISO/IEC 10646, and the writers of the
Unicode Standard see this as a major feature of Unicode over ISO/IEC 10646.
Plain Text
Plain Unicode is a sequence of character codes. Unicode does not currently define any codes
for specifying language, font, etc., although there are character codes that provide hints about
directionality of the text.
In Unicode terms, SGML and XML are “fancy text” or “higher-level protocols” since their
markup represents additional data structures interspersed in the stream of plain Unicode
characters.
The “plain text” nature of Unicode is set to change since a recent Unicode Technical Report
[TR7 98] proposes the addition of language tags using characters on Plane 14 of ISO/IEC
10646. Since the language tags cannot be confused for textual content, parsing for them is, in
theory, easier, than, say, using SGML or XML, and they are seen as a lighter-weight
mechanism for specifying language.
Logical Order
Characters are stored in their logical order: in the sequence in which they are read, which is
not always the sequence in which they are displayed. For example, Arabic and Hebrew are
read right-to-left. Since the logical start of the right-to-left text is the character closest to the
right margin, that character is the first character in the Unicode character stream, as shown in
Figure 4.
Display order
Example of "tfel ot thgir" text
Logical order
Example of "right to left" text
Figure 4.
Bidirectional Ordering
The characters' directionality properties, and use of character codes specifying changes in
direction when mixing characters of different dominant direction, provide sufficient
information for correct rendering of the text.
In addition, the Unicode Standard specifies that combining marks always follow their base
character. In contrast, existing encodings are not standardised, and some require the
combining characters before the base character and some after.
Unification
Unicode unifies characters within scripts across languages so that characters with equivalent
form are given a single code. For example, common letters, punctuation marks, symbols, and
diacritics were each given one code. In addition, over 120,000 ideographs used in Chinese,
4
XML/SGML Asia Pacific 99
Tony Graham
• Unicode:
What is it and how do I use it?
Japanese, and Korean were unified to 21,204 character codes (expanding to 27,786 in
Version 3.0).
The exceptions are “compatibility characters” that could have been unified but remained as
separate code positions, often in support of round-trip mapping between Unicode and an
existing code set. For example, ISO 8859-1 includes the character “Ö”, and Unicode includes
“O” and “¨” plus “Ö” as a compatibility character even though “Ö” could be remapped to the
two-character sequence without loss of information.
Han Unification: The Downside
The downside of the unification process is that a “Unicode” font would use the same glyph
for the same character number, not matter what glyph variations existed for the “un-unified”
forms of the character. At best, this leads to homogenised fonts that use generic
representations of the characters and please no-one, and at worst this leads to the occasional
“wrong” character when the font uses the wrong glyph variant for a user' locale.
In principle, unifying the CJK ideographs is no different than, say, unifying the French “a”
and the English “a” or encoding a single abstract “a” to represent all of the multiple “a”
characters you saw in Figure 1. In practice, however, the different scripts have each had
centuries in which to develop local variations on the abstract character, so perhaps the CJK
unification is closer to unifying blackletter and sanserif forms of the same character. To
illustrate this point, Figure 5 shows “knoll” written with three variants of the Lucida typeface.
They all have a common root in the Lucida design, but if you are used to reading one variant,
the fact that they have a common root won't make you instantly comfortable with the other
forms.
knoll knoll knoll
Figure 5.
“knoll” written with three variants of the Lucida typeface
Table 2 shows how three ideographs rendered, where possible, in a Simplified Chinese,
Traditional Chinese, Japanese and Korean font. Note how the characters are not all identical
across scripts but nor are they all radically different.
Table 2
Sample Simplified Chinese, Traditional Chinese, Japanese, and
Korean glyphs
Unicode Definition
character
number
U+5E73
U+72AC
U+7DF4
Simplified
Chinese
(MS Song)
glyph
Traditional
Chinese
(MingLiU)
glyph
Japanese (MS Korean
Mincho)
(GulimChe)
glyph
glyph
Flat, level,
even, peaceful
Dog
Practice, drill, –
exercise, train
It is universally acknowledged that extra information can and should be added to the
document, using the proposed language tags or a “higher-level protocol” such as XML
XML/SGML Asia Pacific 99
5
Unicode: What is it and how do I use it?
• Tony Graham
markup and the xml:lang attribute, to specify the correct language and/or locale for the text,
and this can be used to key the selection of fonts for representing the characters. This does,
however, lead to more complex software and detracts from the “Characters, not Glyphs”
design principle.
A benefit of the CJK ideograph unification is that the Unicode Standard incorporates
characters for each of the characters in its primary sources, so a Unicode encoding can be
used as a “hub” when transcoding from one source standard into another.
Dynamic Composition
We have already seen “O” and “¨” as an example of a base plus a combining character.
Instead of allowing just “well-known” accented characters such “Ö”, Unicode allows
dynamic composition of accented forms where any base character plus any combining
character (or sequence of combining characters) can make an accented form.
Equivalent Sequence
Since some character may be represented in precomposed or dynamically composed forms,
the Unicode Standard defines equivalent sequences for each precomposed form. Since
sequences of combining characters can follow a base character, the Standard also defines a
canonical ordering for combining characters.
Note that the Unicode Standard does not prescribe one particular internal representation of
composed characters or one particular sequence of combining characters. Systems may
choose to normalise Unicode text to one particular representation, and the W3C work on
“Requirements for String Identity Matching and String Indexing” may standardise that for the
entire Web, but in Unicode all sequences of characters are permitted.
Convertibility
Round-trip conversion between Unicode and many pre-existing standards is possible since
each character has a unique correspondence with a sequence of one or more Unicode
characters. When a base standard includes multiple variant forms of a single character, the
variants are not unified so there will always be a mapping between Unicode and the base
standard.
While accurate convertibility is guaranteed, many conversions require a mapping table since
the corresponding Unicode characters may not be in the same sequence as in the base
standard or a base standard character may map to a sequence of Unicode characters.
Guaranteeing convertibility has required compromises such as the inclusion of many
compatibility characters, but this does mean that Unicode can become a replacement or
alternative for these pre-existing standards. It also means that an application can read or write
using a pre-existing coded character set but be “Unicode inside”.
Unicode and ISO/IEC 10646
The Unicode Standard, Version 2.0, is code-for-code compatible with ISO/IEC 106461:1993, Information Technology – Universal Multiple-Octet Coded Character Set (UCS) –
Part 1: Architecture and Basic Multilingual Plane (plus its first seven amendments).
ISO/IEC 10646 is a four-octet (32-bit) coded character set, although it currently only
specifies characters in the Basic Multilingual Plane (BMP), and these can be represented with
6
XML/SGML Asia Pacific 99
Tony Graham
• Unicode:
What is it and how do I use it?
16-bit characters. The BMP includes characters in general use in alphabetic, syllabic, and
ideographic scripts together with various symbols and digits. ISO/IEC 10646 specifies the
names and coded representation of these graphic characters. In addition, two coded
representations are specified: UCS-4, the four-octet (32-bit) form of the UCS, and UCS-2, the
two-octet (16-bit) BMP form of the UCS.
In ISO/IEC 10646 terms, the BMP is “Plane 00 of Group 00”. Characters outside the BMP
will require more than 16 bits for their coded representations.
The ISO/IEC Working Group and the Unicode Technical Committee have not begun the
formal process of creating planes outside the BMP. There are, however, many proposals for
assignments in Plane 01, plus Plane 02 is reserved for more CJK ideographs (with possible
overflow into Plane 03), and Planes 15 and 16 are reserved for private use. It is expected that,
even with two planes reserved for private use, that it will never be necessary expand beyond
the 17 planes addressable with UTF-16.
Timeline
The Unicode Standard and ISO/IEC 10646 began as independent efforts. They merged in
1992, and today the Unicode Technical Committee and the ISO/IEC Working Group operate
in parallel.
Table 3
Unicode and ISO/IEC 10646 timeline
1989
1990
1990
1991
1992
1992
1993
1993
1995
1996
1998
2000
DP 10646 published, independent of Unicode
Unicode 1.0 published
DIS-1 10646, independent of Unicode
Unicode and ISO 10646 agree to merge
Unicode 1.0.1, modified for merger
DIS-2 10646, merged with Unicode
IS 10646-1:1993, merged standard
Unicode 1.1, revised to match 10646-1:1993
10646 amendments
Unicode 2.0, revised to cover amendments
Unicode 2.1, corrected errors, added Euro
Unicode 3.0, ISO/IEC 10646-1:2000, possibly ISO/IEC 10636-2:2000
Unicode's Extra Semantics
To ISO/IEC 10646, the Unicode Standard is a “profile” that implements certain portions of
ISO/IEC 10646. In particular, Unicode supports only the first 16 planes, and counts as
“implementation level 3” since it includes both combining marks and precomposed
characters.
To Unicode, ISO/IEC 10646 is a superset of the Unicode Standard since it can include more
characters than can Unicode. However, the Unicode Standard defines additional character
semantics beyond those defined in ISO/IEC 10646, which the Unicode Standard stresses as
the major difference between the two standards.
In ISO/IEC 10646, the character name is the main resource for character semantics and crossmapping between standards. The character names are identical in the Unicode Standard, but
XML/SGML Asia Pacific 99
7
Unicode: What is it and how do I use it?
• Tony Graham
that Standard adds or includes alias names, usage annotations, character properties,
conformance specifications, and tables of explicit mappings between the Unicode Standard
and pre-existing standards.
What in the World is Included?
Table 4 lists the character blocks in the Unicode Standard, Version 3.0. Note that some block
names are different in ISO/IEC 10646-1:1993.
Table 4
Start End
0000 007F
0080 00FF
0100
017F
0180
024F
0250
02AF
02B0 02FF
0300
036F
0370
0400
03FF
04FF
0530
0590
0600
058F
05FF
06FF
0700
074F
0780
0900
0980
07BF
097F
09FF
0A00 0A7F
0A80 0AFF
0B00 0B7F
0B80 0BFF
0C00 0C7F
8
Character blocks in the Unicode Standard, Version 3.0
Name
Basic Latin. This is the same as ASCII (ISO 646-IRV).
Latin-1 Supplement. This is the same as the corresponding character
numbers in ISO 8859-1.
Latin Extended-A. Additional letters that extend the Basic Latin and Latin-1
Supplement blocks to support additional languages such as Afrikaans,
Polish, and Welsh.
Latin Extended-B. Additional letters required supporting additional
languages such as Croatian.
IPA Extensions. Symbols of the International Phonetic Alphabet (IPA) that
are not encoded elsewhere.
Spacing Modifier Letters. Signs indicating modification of a preceding letter
principally used in phonetics.
Combining Diacritical Marks. Diacritical marks for use with any script.
Remember that in the Unicode Standard, combining characters can be used
with any base character.
Greek. Modern Greek, archaic Greek, and Coptic letters.
Cyrillic. Letters for Slavic languages, including some historic Cyrillic letters
rarely used in modern forms.
Armenian. Used primarily for writing the Armenian language.
Hebrew. Used for Hebrew, Yiddish, Judezmo, and other languages.
Arabic. Used for Arabic, and also for other languages such as Persian and
Urdu.
Syriac. Language that grew out of Aramaic and now mainly used as the
liturgical language for certain Christian churches.
Thaana. Script for the Dhivehi language used in the Maldives.
Devanagari. Used for classical Sanskrit and its modern derivatives.
Bengali. A North Indian script derived from Sanskrit and used for Bengali
plus other languages such as Assamese.
Gurmukhi. Used to write the Punjabi language.
Gujarati. Used to write Gujarati and Kacchi.
Oriya. Another North Indian script related to Devangari that is used for the
Oriya language and also for Khondi and Santali.
Tamil. Used for the Tamil language, which is used in southern India, Sri
Lanka, Singapore, and parts of Malaysia.
Telugu. Used for the Telugu language of Andhra Pradesh state in India and
also for minority languages such as Gondi and Lambadi.
XML/SGML Asia Pacific 99
Tony Graham
• Unicode:
What is it and how do I use it?
Start End Name
0C80 0CFF Kannada. Used for the Kannada language of Karnataka state in India and
also for minority languages such as Tulu.
0D00 0D7F Malayalam. Used for the Malayalam language of Kerala state in India.
0D80 0DFF Sinhala. Used for Sinhalese, the official language of Sri Lanka.
0E00 0E7F Thai. Used for the Thai language and other languages such as Kuy, Lavna,
and Pali.
0E80 0EFF Lao. Used for Lao, the official language of Laos.
0F00 0FFF Tibetan. Used for Tibetan and other languages such as Ladakhi and Lahuli.
1000 109F Myanmar. Used for Burmese and other languages spoken in Myanmar such
as Mon, Karen, and Shon.
10A0 10FF Georgian. Used for the language of Georgia on the southeastern corner of
the Black Sea.
1100 11FF Hangul Jamo. Alphabetic components that are combined, or conjoined, to
form the syllables of the Hangul script for the Korean language.
1200 137F Ethiopic. Used for Amharic and other languages of Ethiopia
13A0 13FF Cherokee. Syllabary of the Cherokee language.
1400 167F Unified Canadian Aboriginal Syllabics. Syllabic characters for languages
such as Cree, Ojibwe, Athabaskan, and Inuit.
1680 169F Ogham. Archaic Irish script.
16A0 16FF Runic. Archaic Norse script.
1780 17FF Khmer. Script for the national language of Cambodia.
1800 18AF Mongolian. Script for the Mongolian language.
1E00 1EFF Latin Extended Additional. Precomposed combinations of Latin letters
included in the Unicode Standard for compatibility with the developing ISO
10646 standard.
1F00 1FFF Greek Extended. Precomposed combinations of Greek letters included in the
Unicode Standard for compatibility with the developing ISO 10646
standard.
2000 206F General Punctuation. Punctuation characters, some of which are used by
many different scripts.
2070 209F Superscripts and Subscripts. Superscript and subscript characters included
for compatibility with existing character sets.
20A0 20CF Currency Symbols. Currency symbols not encoded in other blocks.
20D0 20FF Combining Marks for Symbols. Diacritical marks that are generally applied
to mathematical or technical symbols.
2100 214F Letterlike Symbols. Symbols derived from letters of alphabetic scripts.
2150 218F Number Forms. Number forms, e.g. roman numerals, included for
compatibility with existing character sets.
2190 21FF Arrows.
2200 22FF Mathematical Operators.
2300 23FF Miscellaneous Technical.
2400 243F Control Pictures. Representations of the names of the C0 control characters,
and the space, delete, and newline characters.
2440 245F Optical Character Recognition. OCR-A characters that are not otherwise
encoded and MICR symbols used in check processing.
XML/SGML Asia Pacific 99
9
Unicode: What is it and how do I use it?
Start End
2460 24FF
2500
257F
2580
259F
25A0 25FF
2600
2700
26FF
27BF
2800
2E80
28FF
2EFF
2F00
2FDF
2FF0
2FFF
3000
303F
3040 309F
30A0 30FF
3100 312F
3130
318F
3190
319F
31A0 31BF
3200 32FF
3300
33FF
3400
4DB5
4E00
9FFF
A000 A48F
A490
AC00
D800
DB80
10
A4CF
D7A3
DB7F
DBFF
• Tony Graham
Name
Enclosed Alphanumerics. Enclosed numbers and letters included for
compatibility with existing standards.
Box Drawing. Box drawing characters included for compatibility with
existing standards.
Block Elements. Graphic characters representing various filled or stated
blocks included for compatibility with existing standards.
Geometric Shapes. Graphic characters representing various geometric
shapes.
Miscellaneous Symbols.
Dingbats. Characters from the well-known ITC Zapf Dingbats font that are
not otherwise encoded.
Braille Patterns. Braille pattern symbols.
CJK Radicals Supplement. Additional radicals not included in the 214
Kangxi radicals.
Kangxi Radicals. 214 characters for the fundamental components Han
ideographs.
Ideographic Description Characters. Characters used for describing a one
layout of ideographic character when the correct character is not available.
CJK Symbols and Punctuation. Punctuation marks and symbols used in
China, Japan, and Korea.
Hiragana. The hiragana syllabary used to write Japanese.
Katakana. The katakana syllabary used to write loan-words in Japanese.
Bopomofo. Phonetic characters used for Chinese, principally the Mandarin
language.
Hangul Compatibility Jamo. Additional jamo included for compatibility
with KSC 5601.
Kanbun. Literally "Chinese writing by/for Japanese", these are marks used
in Japanese text to indicate the Japanese reading order of classical Chinese
text.
Bopomofo Extended. Additional Bopomofo characters.
Enclosed CJK Letters and Months. Characters with enclosing circles, boxes,
or parentheses included for compatibility with existing standards.
CJK Compatibility. Characters included for compatibility with existing
standards.
CJK Unified Ideographs Extension A. Additional ideographs added with
Version 3.0.
CJK Unified Ideographs. Han ideographs unified from Chinese, Japanese,
and Korean sources.
Yi Syllables. Yi, also known as Lolo, is a script resembling Chinese in
overall shape that is used in the Yunnan province of China.
Yi Radicals. Basic units of the Yi syllables.
Hangul Syllables. Composed syllables for the Korean writing system.
High Surrogates. The first code value in a surrogate pair.
High Private Use Surrogates. The first code value in a surrogate pair that
addresses the private use areas that are accessible with UTF-16.
XML/SGML Asia Pacific 99
Tony Graham
• Unicode:
What is it and how do I use it?
Start End Name
DC00 DFFF Low Surrogates. The second code value in a surrogate pair.
E000 F8FF Private Use. Reserved for characters defined by private agreement between
users.
F900 FAFF CJK Compatibility Ideographs. Ideographs included for compatibility with
existing standards.
FB00 FB4F Alphabetic Presentation Forms. Presentation forms of Armenian, Hebrew,
and Latin characters included for compatibility with existing standards.
FB50 FDFF Arabic Presentation Forms-A. Presentation forms—for example, contextual
variants, ligatures, and ornate forms—of Arabic characters included for
compatibility with existing standards.
FE20 FE2F Combining Half Marks. Pairs of characters that are each the presentation
form of one half of a combining mark that applies to multiple base
characters.
FE30 FE4F CJK Compatibility Forms. Presentation forms of ideographs that are already
encoded by the standard.
FE50 FE6F Small Form Variants. Small variants of ASCII punctuation characters
included for compatibility with CNS 11643.
FE70 FEFE Arabic Presentation Forms-B. Additional Arabic presentation forms.
FEFF FEFF Specials. The ZERO WIDTH NO-BREAK SPACE character, which also
has significance as an encoding form signature.
FF00 FFEF Halfwidth and Fullwidth Forms. Presentation forms included for round-trip
mapping with existing CJK character sets.
FFF0 FFFD Specials. Characters with special significance for Unicode processing.
What's in the Works?
Table 5 lists planned or proposed scripts from around the world (and out of this world).
Table 5
Scripts proposed for inclusion in the Unicode Standard
Name
Avestan
Basic Egyptian
Hieroglyphics
Blisssymbolics
Brahmi
Buginese
Byzantine
Musical
Symbols
Cham
Cirth
Coptic
Cypriot
Description
An ancient Iranian language that was the sacred language of the
Zoroastrians.
Used by scholars to write Ancient Egyptian.
A symbol system developed by Charles Bliss as a visual supplement to
speech.
Script used in ancient India.
Language spoken on the island of Celebes in Indonesia.
Symbols used in the Byzantine musical tradition
One of the minor languages spoken in Vietnam.
Script invented by J. R. R. Tolkien and used in Lord of the Rings and The
Silmarillion.
Liturgical language of the Coptic Church.
Syllabary for the Cypriot dialect of Greek that was used from about 800
XML/SGML Asia Pacific 99
11
Unicode: What is it and how do I use it?
Name
Syllabary
Deseret
Alphabet
Etruscan
Glagiotic
Gothic
Javanese
Linear B
Meroitic
Old Hungarian
Runic
Old Permic
Old Persian
Cuneiform
Phillipine Scripts
Phoenician
Pollard
• Tony Graham
Description
BC to 200 BC.
Phonetic alphabet for English used by The Church of Jesus Christ of
Latter-day Saints. Created in the 19th century, it has never been widely
used.
Script for the Etruscan and Oscan languages. The script was used from
the 7th century BC to the first century AD.
Alphabet ascribed to St Cyril that was formerly used for some Slavic
alphabets.
Phonetic alphabet used by the Germanic tribe of the Goths that was
invented in the fourth century by the Gothic bishop, Wulfila.
Language spoken in central and eastern Java. No connection with the
computer language.
Oldest known Greek syllabic system that was used on Crete until at least
1375 BC.
Script used by the ancient Meroites in what is now Sudan.
Runes
A language from northeastern Russia
Used by the scholarly community to write Old Persian.
Scripts for writing Tagalog, Hanunóo, Buhid, and Tagbanwa.
Script of ancient Phoenicia.
Script invented by Samuel Pollard and used for about a dozen languages
in Southeast Asia.
Rong
Script based on Tibetan writing that was devised in 1720.
Shavian
Phonetic alphabet designed by Kingsley Read that won a competition for
a new alphabet established under the terms of George Bernard Shaw's
will. Also known as “Shaw's alphabet” and the “Proposed British
Alphabet”.
Sinaitic
An ancient Semitic script
South Arabian
Ancient script from southern Arabia.
Soyombo
Alphabetic script invented in 1686 for writing Mongolian, Tibetan, and
Sanskrit.
Tai (Dai) scripts A family of languages from Southeast Asia. Thai and Lao are both
offshoots of Tai.
Tengwar
Script invented by J. R. R. Tolkien and used in Lord of the Rings and The
Silmarillion.
Tifinagh
Writing system consisting only of consonants that is used by the Tuareg
(Berber)
tribesmen.
tlhingan Hol
Klingon script.
Ugaritic
Used by the scholarly community to write Ugaritic.
Cuneiform
Western Musical Symbols from the Western musical tradition.
Symbols
12
XML/SGML Asia Pacific 99
Tony Graham
• Unicode:
What is it and how do I use it?
Is it a Universal Character Set?
As we have seen, the Unicode Standard and ISO/IEC 10646 do support a multitude of scripts,
but their estimated success or failure in representing the major scripts of the world varies
depending on whose opinion you ask. To many in Asia, the answer is a resounding “No”,
largely because of dissatisfaction with the Han unification process that merged Chinese,
Japanese, and Korean ideographs.
Unicode supports many base standards as they existed in 1993 (and Version 3.0 will align the
Unicode Standard with newer versions of some of those standards), but the field of language
is not static, and new characters are continually being created. According to [Huang and
Huang 89], the Chinese are inventing many new characters each year, and it remains to be
seen how well new characters are added to the repertoire.
In addition, Unicode does not yet support some modern scripts, although some unsupported
scripts are for languages that are commonly written with other, supported scripts. While
Unicode also doesn't support many archaic and obsolete scripts, several of these are proposed
for future inclusion.
Who Uses Unicode or ISO/IEC 10646?
Both the list of software that supports Unicode and the list of applications that require
Unicode support are continually growing. On the software side, this includes:
•
Java
•
MacOS
•
Windows NT
•
Windows 95 (partial)
•
AIX
•
Plan 9
•
NetWare 4.0
•
QuickDraw GX
•
Jade
•
nsgmls and some other SGML parsers
•
XML applications
•
Tcl/Tk
•
Perl
•
ECMAScript/Javascript/JScript
•
Visual Basic and VBScript
Applications that require Unicode support include:
•
XML
•
HTML 4.0
•
Java
XML/SGML Asia Pacific 99
13
Unicode: What is it and how do I use it?
• Tony Graham
How do I specify “@´"
The character set may be standardised, but a character's representation in programs and data
VWLOOYDULHVZLWKWKHDSSOLFDWLRQ7DEOHVKRZVVRPHRIWKHZD\VWKDWWKH³@´FKDUDFWHULV
represented.
Table 6
@
+RZ 'R , 6SHFLI\ ³ ´"
Standard or
Application
Representation
Comment
Unicode
U+0416
An individual Unicode value is expressed as U+nnnn,
where nnnn is the character's number expressed in
hexadecimal notation.
CYRILLIC CAPITAL LETTER
ZHE
All Unicode characters have unique names, which are
the same as those in the English language version of
ISO/IEC 10646. Names contain only the uppercase
letters A to Z, space, hyphen-minus, and occasionally
digits. As this table shows, this makes it easy to
generate computer-sensible identifiers for individual
characters.
0000 0416
UCS-4
0416
UCS-2
CYRILLIC CAPITAL LETTER
ZHE
ISO/IEC 10646 assigns a unique name to each
character.
Java
\u0416
Java defines a special Unicode escape sequence for
representing the hexadecimal value of a Unicode
character.
Pre-TC SGML
Ж
Prior to the WebSGML Adaptations TC, SGML
allowed only decimal numeric character references.
XML and
WebSGML
Adaptations
SGML
Ж
XML, and SGML after the WebSGML Adaptations
TC, allows both decimal and hexadecimal numeric
character references. The numeric references do not
contain a “u” or otherwise signal that “this is
Unicode” since they are references to character
numbers in the document character set and, in
principle, any single-byte or multi-byte coded
character set could be used in an SGML application.
ISO/IEC 10646
Ж
SGML
Declaration
1046
Decimal character numbers only may be used.
Ж
Decimal numeric character references may be used in
minimum literals
DSSSL
#\cyrillic-capital-letter-zhe;
Characters may be referenced by an identifier derived
from the character name.
#\U-0416
Jade also supports of the form “U-” plus the
hexadecimal character number.
How do I use Unicode with SGML?
After the fundamental step of selecting SGML software that supports Unicode, you can then
reference ISO/IEC 10646 as the BASESET in the CHARSET declaration in your SGML
14
XML/SGML Asia Pacific 99
Tony Graham
• Unicode:
What is it and how do I use it?
Declaration. The following CHARSET declaration is taken from the SGML Declaration for
XML:
CHARSET
BASESET
"ISO Registration Number 176//CHARSET ISO/IEC
10646-1:1993 UCS-2 with implementation level
3//ESC 2/5 2/15 4/5"
DESCSET
0
9 UNUSED
9
2 9
11
2 UNUSED
13
1 13
14
18 UNUSED
32
95 32
127
1 UNUSED
128
32 UNUSED
160 65376 160
You should also refer to ISO/IEC 10646 as your syntax reference character set, and,
additionally, allow non-ASCII characters in names.
How do I use Unicode with XML?
All XML processors are required to support both the UTF-8 and UTF-16 encodings of
Unicode characters. XML processors may also support other encodings, and the XML
Recommendation lists some of them, but an XML document or parsed entity that is not in
either UTF-8 or UTF-16 must begin with an XML Declaration or text declaration that
specifies the encoding, for example, <?xml encoding="EUC-JP"?>.
To distinguish between the two encodings that do not need an encoding declaration, parsed
entities in the UTF-16 encoding must begin with the Byte-Order Mark (BOM). The BOM is
used in Unicode systems as a signature for detecting the sequence of bytes in each 16-bit
character. When the first two bytes in the parsed entity are FE16 and FF16, then the bytes are in
the correct order for the processor, and when the first two bytes are FF16 and FE16, the byte
pairs need to be transposed before interpretation as Unicode characters. The BOM is
unnecessary for UTF-8, the 8-bit encoding of Unicode characters, so its presence is used in
XML systems to indicate the UTF-16 encoding.
XML parsed entities may use any legal graphic character specified in the Unicode Standard
and ISO/IEC 10646. Any Unicode character that is not allowed in the current encoding may
be included by a numeric reference to that character's number in the ISO/IEC 10646
repertoire. Once characters outside the Basic Multilingual Plane become part of Unicode and
ISO/IEC 10646, you will be able to represent them in both UTF-8 and UTF-16. In UTF-16,
you reference the characters using a two-character sequence called a “Surrogate Pair”. If you
can't directly encode the characters outside the BMP, you can make a numeric character
reference to their character number. You cannot reference characters by pairs of numeric
references to the corresponding Unicode Surrogate Pair code values.
As stated previously, one of Unicode's design principles is “Equivalent Sequence” between
precomposed characters and their canonical or compatibility decompositions. By default,
however, XML does not consider that precomposed characters and their compatibility
decompositions are equal in string comparisons. The XML recommendation does state that
“at user option, processors may normalise such characters to some canonical form”, but a
quick survey doesn't show any XML parsers implementing this user option.
XML/SGML Asia Pacific 99
15
Unicode: What is it and how do I use it?
• Tony Graham
For example, according to the Unicode standard, LATIN SMALL LETTER O +
COMBINING DIAERESIS (o + ¨) and LATIN SMALL LETTER O WITH DIAERISIS (ö)
are canonical equivalents, but if they were used as the name in an element's start-tag and endtag, respectively, this would cause an error because they are not equivalent according to the
XML recommendation (Figure 6).
Figure 6.
Canonical equivalents are not equivalent to XML
This is more likely to happen, however, if the two occurrences of the name or string were in
two different entities, such as a document and its DTD, created using different editors that
output different, but canonically-equivalent under Unicode, character sequences for the same
text.
It remains to be seen whether the forthcoming W3C work on string identity matching
obviates this difference from Unicode before it becomes a problem for XML. It may help you
with XML data sent over the Web, but it probably won't save you from causing yourself
problems with different software on your own computer.
Can I use ISO Entities with Unicode?
Annex D of ISO 8879 declared several sets of general entities, and these have been in wide
use over the last thirteen years. ISO TR 9573 superseded Annex D and declared more sets of
general entities.
Despite this longevity and comparatively widespread use, Unicode and ISO/IEC 10646 do
not include characters for all of the ISO entities. If you want to stay with the official character
repertoire, XML-compatible versions prepared by Rick Jelliffe of several ISO entity sets are
available at http://www.altheim.com/specs/charents.html.
The W3C Mathematical Markup Language (MathML) effort needed all of the ISO TR 9573
entities and more besides, so the MathML recommendation [MathML 99] defines characters
for the additional entities and, for now, standardises character numbers for these characters in
the user-defined area of Unicode and ISO/IEC 10646. The recommendation states that the
“STIX project of the STIPUB group of scientific and technical publishers” is working both
on making available a complete set of fonts for Unicode characters for science and
technology and on proposing inclusion of these characters in Unicode and ISO/IEC 10646.
The STIPUB Consortium, working through the American Mathematical Society, submitted a
proposal for additional math symbols and technical symbols to the ISO/IEC working group in
March 1998 [WGReg 98], but additional math symbols and technical symbols are not
currently listed among Unicode's proposed scripts [Prop 99].
Details of the characters, including lists organised by entity name and by Unicode character
number, are available at [MathML 99]. The entities are current, revised version of the
MathML DTD [MathML DTD]. Since the character numbers for these additional characters
are in the private use area, it is best to take the advice of the MathML recommendation and
16
XML/SGML Asia Pacific 99
Tony Graham
• Unicode:
What is it and how do I use it?
refer to these characters by entity name only, as the character numbers will change if the
characters are officially included in Unicode and ISO/IEC 10646.
What are the Unicode and ISO/IEC 10646 Coded
Representations?
The Unicode Standard and ISO/IEC 10646 share a bewildering variety of coded
representations for the same characters. The following sections will help in deciphering the
acronyms.
UCS-4 – Four-Octet (32-bit) Canonical Form of the UCS
UCS-4 is the four-byte representation of characters in ISO/IEC 10646. It is not used in the
Unicode Standard. It can represent more characters than the Unicode Standard intends to ever
represent.
UCS-2 – Two-Octet (16-bit) BMP Form of the UCS
UCS-2 is the two-byte representation of characters defined in ISO/IEC 10646, but the
Unicode Standard, Version 2.0, is compatible with UCS-2. The Unicode Standard, Version
3.0, will outgrow UCS-2 when it declares planes 15 and 16 as available for private use.
UTF-7 – UCS Transformation Format, 7-Bit Form
UTF-7 is defined in IETF RFC 1642, and is summarised in an Appendix to the Unicode
Standard 2.0, but it is not part of ISO/IEC 10646. UTF-7 is a 7-bit form that is safe for use
with email, Simple Mail Transport Protocol (SMTP), and Multimedia Internet Mail
Extensions (MIME). Each 16-bit Unicode character is encoded as between 1 and 2] 7-bit
characters.
UTF-7 is sometimes considered to be obsolete, but there are still 7-bit mail systems in use
around the world.
UTF-8 – UCS Transformation Format, 8-Bit Form
UTF-8 is common to both the Unicode Standard and ISO/IEC 10646, and all XML
processors are required to support this format. Originally developed by X/Open to enable use
of Unicode character data in 8-bit Unix environments, this transformation format was known
as File System Safe UTF (FSS-UTF) and as UTF-2 before being renamed UTF-8 and
included as a normative addendum to ISO/IEC 10646.
UTF-8 encodes each Unicode character as one, two, or three bytes (and UCS-4 characters as
up to six bytes). All of the ASCII code values are represented as a single byte, most nonideographic characters are represented as two-byte sequences, and the remaining Unicode
characters are represented as three-byte sequences. Once characters outside the BMP are
defined, the Surrogate Pair codes that Unicode uses to represent these characters will be
represented as four-byte sequences
UTF-8 is a simple and efficient conversion and a reasonably compact encoding, and it is
compatible with ASCII since the ASCII characters map to the same coded representation in
UTF-8. However, many Japanese, for example, do not like UTF-8 since their files become
bigger when the current double-byte characters are converted to three bytes of UTF-8.
XML/SGML Asia Pacific 99
17
Unicode: What is it and how do I use it?
• Tony Graham
UTF-16 – UCS Transformation Format for Planes of
Group 00
UTF-16 is an ISO/IEC 10646 encoding that is equivalent to the Unicode Standard with the
use of surrogates. The only difference between UCS-2 and UTF-16 is the use of surrogates
with UTF-16. However, while XML specifies support for UTF-16, and UTF-16 includes the
code points for surrogate pair characters, the character numbers for the code points in the
surrogate blocks are not legal XML character numbers. Surrogate pairs in a UTF-16
document will be refer to a non-UCS-2 character, but numeric character references to a
character number in the surrogate blocks are not legal XML.
Use of code points in the surrogate blocks, or the S (Special) Zone of the BMP as it is called
in ISO/IEC 10646, enables addressing of characters in planes 1…1610 of group 0 of UCS-4;
that is, it allows addressing of over one million additional UCS-4 code values in the range
00010000…0010FFFF16.
The Unicode Consortium now refers to UTF-16, UTF-16BE, and UTF-16LE. UTF-16 is
UTF-16 with the Byte-Order Mark (BOM) to indicate the order of the two bytes of a UTF-16
character. UTF-16BE is UTF-16 without the BOM and with the most significant byte first. It
is the “native” UTF-16 format for big-endian processors. UTF-16LE is UTF-16 without the
BOM and with the least significant byte first. It is the “native” format for little-endian
processors such as Pentiums.
UTF-32
This is a new 32-bit format published by the Unicode Consortium. UTF-32 characters are
identical to UCS-4 characters over the UTF-32 character range, but, by design, UTF-32 only
addresses the characters in planes 00 to 16 that you can represent using UTF-16. In addition,
UTF-32 characters are specified as having the full Unicode character semantics, which you
cannot guarantee for text labelled as UCS-4.
Which is best for me?
There is no single, simple answer to this question. The choice of encoding will depend in part
upon your language and in part upon the tools that you are using. For example, if you are
working in English, it is simplest to use UTF-8 since UTF-8 is a superset of ASCII, but if you
are working in Japanese, it would be preferable to use UTF-16, since using UTF-8 would
result in larger files. However, if you working with Perl, the latest version of which handles
Unicode characters as UTF-8 internally, it might be simplest to only use UTF-8 for input and
output.
Your choice may be determined by your choice of application, since even on “little-endian”
Intel processors, some applications save Unicode text as UTF-16BE. The good news is that
recent versions of web browsers can handle both UTF-8 and UTF-16 text, as can your XML
software.
Further Information
Definitive information on the Unicode Standard is available from the Unicode Consortium at
http://www.unicode.org.
Information on ISO/IEC 10646 is available at http://www.dkuug.dk/jtc1/sc2/wg2/.
18
XML/SGML Asia Pacific 99
Tony Graham
• Unicode:
What is it and how do I use it?
Bibliography
[Huang and Huang 89]
Huang, Jack K. T., and Timothy D. Huang. 1989. An
Introduction to Chinese, Japanese, and Korean Computing. Singapore: World
Scientific
[MathML 99] Mathematical Markup Language, Chapter 6, Entities, Characters, and Fonts,
http://www.w3.org/TR/REC-MathML/chapter6.html
[MathML DTD]
Mathematical Markup Language, Appendix A, The MathML DTD,
http://www.w3.org/TR/REC-MathML/appendixA.html
[Prop 99]
Unicode Consortium, Proposed New Scripts,
http://www.unicode.org/pending/pending.html
[TR7 98]
Unicode Consortium, Technical Report #7, Plane 14 Characters for Language
Tags, http://www.unicode.org/unicode/reports/tr7.html
[WGReg 98] ISO/IEC JTC 1/SC 2/WG 2, N1750, Partial Document Register (N1600 N1810) + standing documents,
http://wwwold.dkuug.dk/JTC1/SC2/WG2/docs/documents
Biography
Tony Graham
Senior Consultant
Mulberry Technologies, Inc.
17 West Jefferson Street, Suite 207
Rockville MD 20850 U.S.A.
Phone: 301/315-9631
E-mail: [email protected]
Web: http://www.mulberrytech.com
Tony Graham is a Specialist Member of the Unicode Consortium and author of the
forthcoming Unicode: A Primer. He has been keenly interested in character set issues since
1991 when he joined Uniscope, Inc. in Tokyo, Japan, where he developed SGML
applications in Chinese, English, Japanese, and Korean. He is currently a Senior Consultant
with Mulberry Technologies, Inc., an SGML and XML Consultancy specialising in training
and design. Tony has spoken on Unicode to conference and user group audiences, and has
written on Unicode for MultiLingual Computing & Technology.
XML/SGML Asia Pacific 99
19