Bits, bytes, and Unicode

Bits, bytes, and Unicode: An introduction to digital text for linguists
James A. Crippen
3 April 2010
1. IBCDEFGHCIEB
Unicode has been around in some fashion since about 1991. Over the last decade linguists have become aware that something called “Unicode” exists, and that they should be using it because it is somehow better than other things. But in my experience many linguists have no concrete idea what the difference is between Unicode and other things, nor even what those other things are. There is generally a lack of knowledge of how Unicode works, what it is for, and what consequences are associated with it. Unfortunately this situation is not easily rectiDied by simply directing people to existing documentation on Unicode because such documents are generally aimed at computer scientists and software developers, as-­‐
suming an existing foundation in digital text processing. In addition much of the Unicode literature is couched in specialized jargon peculiar to the Unicode tradition. In this article I hope to provide a comparatively gentle introduction to the foundations of digital text, and to describe Unicode in a way that makes it clear what its beneDits and limitations actually are.
I have attempted throughout to ensure that the information presented is both up-­‐to-­‐
date and as timeless as possible, but since technological change is rapid certain things in this article will eventually be obsolete. The version of Unicode at the time of writing is 5.2.0 and hence all references to Unicode are in the context of that version, so readers are en-­‐
couraged to check with newer versions as they become available.
2. BICJ KBF LMCNJ
The foundation of all modern computing is the bit.1 This is simply an element which can have two exclusive values, usually represented as 0 and 1. Bits are the elements which make up the LMNOPQ number system which is often known as LORS TUV in mathematical contexts. Counting in binary is done using units that are powers of two, in contrast with the decimal system that uses units which are powers of ten. The use of 0 and 1 for the two exclusive bi-­‐
nary values is a convenience and a sort of mathematical tradition, but bits may also repre-­‐
sent − or +, false or true, off or on, no or yes, and any other sort of binary opposition. Mathematical logic using the binary number system is sometimes known as LVV[SON [V\M] in honor of George Boole’s invention of true-­‐or-­‐false logic which is mathematically equiva-­‐
lent to counting with only zero and one.
1. There have been a few experimental computers which used ternary (base three) rather than binary, and hence the fundamental unit in these systems was the “trit”. Ternary computing has yet to become popular.
1
hex.
dec.
0
0
1
bin.
hex.
dec.
bin.
hex.
dec.
bin.
hex.
dec.
bin.
0
8
8
1000
10
16
10000
18
24
11000
1
1
9
9
1001
11
17
10001
19
25
11001
2
2
10
A
10
1010
12
18
10010
1A
26
11010
3
3
11
B
11
1011
13
19
10011
1B
27
11011
4
4
100
C
12
1100
14
20
10100
1C
28
11100
5
5
101
D
13
1101
15
21
10101
1D
29
11101
6
6
110
E
14
1110
16
22
10110
1E
30
11110
7
7
111
F
15
1111
17
23
10111
1F
31
11111
Table 1: Hexadecimal, decimal, and binary numbers from 0 to 31.
Table 1 gives the binary numbers from 0 to 31 with their counterparts in the decimal number system, as well as the hexadecimal system which will be discussed later. The rightmost or lowest bit in each number alternates between 0 and 1, and even numbers al-­‐
ways have the lowest bit as 0. Each position in a binary number is a power of two, thus the lowest bit b0 has the value b0 × 20, the next lowest bit b0 has the value b1 × 21, the next bit b2 has the value b2 × 22, and so forth. The total value of a binary number is a sum of each bit’s positional value. Thus the binary number 101 has the value (1 × 22) + (0 × 21) + (1 × 20) = 4 + 0 + 1 = 5.
Binary numbers are extremely difficult for most humans to read, particularly when the values are large. To solve this problem other number systems are used as shorthands for binary. The two used most frequently are octal and hexadecimal. Octal is base eight, i.e. each octal unit has eight possible values usually given from 0 through 7. Hexadecimal2 is base sixteen, so each hexadecimal unit has sixteen possible values from 0 through 15. These two systems are used because they evenly divide binary numbers into small chunks, with octal numbers representing sequences of three bits and hexadecimal numbers representing sequences of four bits. Although octal was prominently used in computing for a few dec-­‐
ades, hexadecimal is today the most popular system for binary shorthand. Since Unicode is always described using hexadecimal, I will not consider octal any further in this paper.
A hexadecimal unit – sometimes known as a “hexit” by analogy with decimal “digit” – represents a sequence of four bits. Because the decimal system for representing numbers does not have unitary symbols beyond 9, the Latin alphabet was drafted to serve for the values from 10 to 15. One thus counts in hexadecimal from 0 to 9 and then from A to F. This representation obviously overlaps with the decimal system in its smaller values, so the con-­‐
vention is to include an indication that a given symbol is hexadecimal rather than decimal. 2. The term “sexagesimal” might be more proper Latin, but this invites the taboo abbreviation “sex” and has thus been avoided.
2
Unfortunately such indicators vary among different programming language traditions; for example the C programming language uses a prefix “0x” to indicate hexadecimal, the Pascal programming language uses “$” as a hexadecimal prefix, and Intel assembly language uses the suffix “h” or “H”. I will consistently use a subscript number 16 for hexadecimal, with a subscript 2 for binary, and a subscript 10 for decimal where necessary; this is in keeping with representation in mathematical texts.3 Thus the number 11112 is 1510 and F16. A hexa-­‐
decimal number CAFE16 is the decimal value (12 × 163) + (10 × 162) + (15 × 161) + (14 × 160) = (12 × 4096) + (10 × 256) + (15 × 16) + (14 × 1) = 49152 + 2560 + 240 + 14 = 51966. This number is also the binary sequence 1100 1010 1111 1110, with the hexadecimal number being much easier to read as well as a quick segmentation of the binary sequence into four bit chunks: C16 = 1101, A16 = 1010, F16 = 1111, and E16 = 1110. Because of the widespread use of hexadecimal to represent binary, large binary numbers are often sepa-­‐
rated into four bit sequences with intervening spaces as above, a practice maintained throughout this article.
Computers do not in general process bits alone, but instead operate on larger sequences called bytes. The actual number of bits in a byte has historically varied between six, seven, eight, and nine bits, but today the eight-­‐bit byte is essentially universal.4 Such bytes are rep-­‐
resented with two hexadecimal units, thus the byte 1101 0010 = D216. There are other sizes of bit sequences that computers operate with, such as “nybble”, “deckle”, “word”, “dword”, “quadword”, and so forth, but we do not need to be concerned with them here.
In working with digital data, it is crucial to comprehend that everything in a computer is binary and hence composed of bytes. This includes text, audio, video, pictures, computer programs, hardware signals, and the pixels on the screen. Computers do not actually “un-­‐
derstand” 5 that there are differences between the bytes that are used for pixels and those that are used for text. Instead humans – both programmers and users – create the contexts in which bytes exist by writing and using computer programs; such contexts give meaning to bytes through the behavior of computers running the computer programs. A useful lin-­‐
guistic analogy is the meaninglessness of speech sounds which are given meaning by their context in a linguistic system.6 The essential point is that bytes are meaningless and com-­‐
puters have no idea what to do with them unless explicitly instructed. Digital data becomes useless when there is no program which can contextualize the data in a useful manner. By maintaining the programs used to operate on data, we prolong the useful life of the data. In the context of digital text we must maintain the standards that prescribe how text is to be 3. The appearance of one of the letters A through F in a sequence of numbers is often enough to show that a given number is hexadecimal, and a long sequence of only 0 and 1 is usually sufDicient to tell that a given number is binary. In such cases, indicating the number system is unnecessary. In certain contexts such as Unicode code point numbers the use of hexadecimal is implicit and hence never indicated.
4. The Common Lisp programming language supports variably sized bytes of anything between one and thirty-­‐six bits. A rarity today, this is due to its ancestry in computing systems using less common byte sizes.
5. A saying in computing is “don’t anthropomorphize computers because they don’t like it”.
6. There is an interesting philosophical question here, whether this mapping of arbitrary units to meanings is merely convenient engineering or whether humans have somehow intentionally recapitulated some struc-­‐
ture of our linguistic systems. I leave this question to the reader for further pondering.
3
represented, as well as the software that implements such standards, otherwise our digital text will become uninterpretable in the future.
3. COKDKHCNDJ, HOKDKHCND JNCJ, KBF HOKDKHCND NBHEFIBPJ
Digital text is essentially any linguistic or paralinguistic information represented in a binary code. The smallest unit of digital text is the ]uOPO]TSP. The term “letter” might at Dirst seem appropriate here, but it is actually too restrictive since digital text may contain num-­‐
bers, punctuation, blanks (“whitespace”), and many other elements which are not them-­‐
selves letters in any traditional sense. Every character is represented by a unique sequence of bits, the value of which is largely arbitrary. To give a sense of the arbitrariness of charac-­‐
ter values, consider the Latin lowercase letter 〈a〉. This is usually represented in binary as the character 0110 0001 = 6116 = 9710. This value has no particular meaning like “low front vowel” or “Dirst letter of the lowercase Latin alphabet” outside of its context in a particular digital text encoding system.
Characters can occur in various sizes, ranging from three bits to thirty-­‐two bits or be-­‐
yond, depending on the particular system. The I Ching (Yì Jīng) trigrams ☰☱☲☳☴☵☶☷ are composed of three bits each, with say a solid line being a zero bit and a broken line be-­‐
ing a one bit. In such a three-­‐bit character system the maximum value is 23 = 8 meaning only eight characters can be represented. Morse Code uses characters of variable size, with early systems reaching a maximum of five bits, although most characters are only two or three bits in length: –– ––– ·–· ··· · MORSE. The largest value in such a system is 25 = 32 so that a maximum of 32 characters can be represented. Braille is a six-­‐bit system where dots are either present or absent: ⠃⠗⠁⠊⠇⠇⠑ BRAILLE. The Braille character set thus has 26 = 64 maximum characters.
Before the spread of Unicode, the most widely adopted standard for digital text was the American Standard Code for Information Interchange, better known as ASCII,7 and usually pronounced /ˈæski/ in English. This is a seven bit system with a maximum number of 127 characters; it is sufficient to support basic English text with a somewhat reduced set of punctuation – the en-­‐dash 〈–〉 and section sign 〈§〉 are absent, for instance. A system which competed with ASCII for some time but which eventually fell to the wayside was IBM’s Ex-­‐
tended Binary Coded Decimal Interchange Code, abbreviated EBCDIC and usually pro-­‐
nounced /ˈɛbsidɪk/. This was a larger eight-­‐bit system with a maximum number of 256 characters, but it died out due to the existence of multiple incompatible versions and some unfortunate design decisions. The ISO 8859 system is also eight bits, as is the Mac Roman system and the Windows-­‐1252 system, among many others. Each of these larger systems was developed to extend the possible number of characters beyond the ASCII limitation so as to support various European languages requiring a larger variety of symbols.
Having brought up several modern digital text coding systems it is pertinent to discuss how such systems are described. Any given collection of characters is called a ]uOPO]TSP 7. ASCII is standardized in the United States by the American National Standards Institute as ANSI X3.4-­‐1986 and its predecessors. Internationally it is also standardized as ISO 646 by the International Standards Or-­‐
ganization. Reference to “ASCII” is more common than reference to the particular standards that deDine it.
4
RST, also sometimes known as a ]VƒS „O\S in the context of MS-­‐DOS.8 A character set does not necessarily specify how the characters in the set are represented in binary, it merely delineates what characters will be available and how they are to be presented. The specifi-­‐
cation for how characters are actually represented in binary is called a ]uOPO]TSP SN]Vƒ-­‐
MN\, usually just abbreviated to “encoding”. In practice this distinction between character set and character encoding is often ignored because the character set and the character en-­‐
coding are usually defined in the same standard, and hence there is no real need to make the distinction. However, when dealing with multiple incompatible character encodings it is useful to refer to a particular character set as an entity separate from the encoding. For ex-­‐
ample the character appearing as 〈a〉 is represented as 6116 in ISO 8859-­‐1, Windows-­‐1252, and Mac Roman, however the character appearing as 〈é〉 is E916 in ISO 8859-­‐1 and Windows-­‐1252 but 8E16 in Mac Roman. Thus these character sets might share the same elements, but their character encodings are different. In the context of Unicode the distinc-­‐
tion between character set and character encoding is essential since the Unicode standard specifies a few distinct encodings for the same character set.
The term PS„SPTVMPS refers to the range of different characters available in a particular character set. ASCII has a small repertoire which only supports English with limited punc-­‐
tuation. In contrast Unicode has an extremely large repertoire which supports most writing systems – R]PM„TR in Unicode parlance – including a number which are now extinct. The distinction between repertoire and character set is somewhat vague, but essentially a rep-­‐
ertoire is abstract whereas a character set is concretely specified according to some stan-­‐
dard.
A ]VƒS „VMNT is a single element in a specific character set. To reuse the earlier example, the code point of the character 〈é〉 is E916 in the ISO 8859-­‐1 character set. Characters are assigned to particular code points, however not all code points are used in most character sets. This may seem like an illogical waste of space but there are sound technical reasons why certain code points are reserved in most character sets. In ASCII the code points be-­‐
tween 0016 and 1F16 are reserved for special control codes which are used to control the behavior of teletypewriters, a vital function when the ASCII standard was formulated. Sub-­‐
sequent character encodings have usually maintained this block of reserved code points for backward compatibility with ASCII. In Unicode there are a large number of code points that form the Private Use Block which is reserved for the implementation of non-­‐standard char-­‐
acters without sacrificing compatibility between different systems. Many other Unicode blocks include small numbers of reserved code points so that there is space available for future expansion.
Text in one character encoding is usually unreadable when processed with another en-­‐
coding, since outside of the desire for backwards compatibility with ASCII there has histori-­‐
cally been little attempt to ensure compatibility between different encodings. The result of viewing text in one encoding using the display system of another encoding is called …V†M-­‐
8. This operating system is often known as simply “DOS”. There are actually several other “Disk Operating Sys-­‐
tems” abbreviated DOS in the history of computing, though Microsoft’s MS-­‐DOS and its close relatives PC-­‐
DOS from IBM and DR-­‐DOS from Digital Research are by far the most popular. I avoid “DOS” for clarity.
5
LO‡S, a term derived from Japanese moji-­‐bake “character-­‐mutation”. If we take the Japanese text 〈文字化け〉 mojibake in UTF-­‐8 – a Unicode encoding explained below – and display it using ISO 8859-­‐1 then the result is the unintelligible mess 〈æ–‡å—化ã▢〉. Mojibake is known by several other names in languages where it is commonplace, such as 〈亂碼〉 or 〈乱
码〉 luànmǎ “chaos code” in Mandarin Chinese, 〈외계어〉 oigyeeo “alien language” in Korean, 〈крякозябры〉 krjakozjabry “childish scribbles” in Russian, and Zeichensalat “character salad” in German. It is the bane of a computer user’s existence when dealing with multiple character encodings, and a major nuisance to people who use a script requiring characters outside of those found in Western European languages. Despite the widespread adoption of Unicode, mojibake still occurs quite frequently. A situation similar to mojibake is when the encoding is handled correctly but the font used for displaying the characters is lacking a particular character, a phenomenon which seems to have no common English name other than “▢▢▢▢▢” in my experience.
Since data which is encoded in one encoding appears as mojibake in another encoding, there is a need for moving data between encodings. Such a process is called ]VN‘SPRMVN. When data is converted from one encoding to another the meanings of characters are re-­‐
tained however their code points are generally different. Conversion can be divided into two types, PV’Nƒ-­‐TPM„ ]VN‘SPRMVN where data can be converted back and forth between two encodings without loss of information, or VNS-­‐UOQ ]VN‘SPRMVN where one encoding has a greater character set repertoire than the other. The design of Unicode has emphasized round-­‐trip conversion at the expense of size and complexity, such that there are many ob-­‐
scure characters in the Unicode repertoire that might otherwise be unused except for the round-­‐trip conversion requirement between a Unicode encoding and some older “legacy” encoding.
Incorrect conversion usually results in mojibake as described previously. On the other hand, incomplete conversion has a number of different symptoms depending on the type of conversion and the encodings involved. DPV„-­‐V’TR are missing characters which appeared in the source data but which are missing entirely in the converted data, e.g. 〈drop•out〉 → 〈dropout〉. T’Pƒ is a colloquial name for encoding elements which should be removed dur-­‐
ing conversion but which somehow survive the process and appear as superfluous charac-­‐
ters in the converted result, for example 〈turd : 糞〉 → 〈turd : ␛$B糞〉 where an invisible ISO 2022-­‐JP escape code in the source data is incorrectly retained in the result. DPSOƒ •’SRTMVN…OP‡ ƒMRSORS is a phenomenon describing documents created in Microsoft Office products which use a Microsoft-­‐specific variant of ISO 8859-­‐1 called Windows-­‐1252. This phenomenon is particularly common with HTML produced by Microsoft Word that disin-­‐
genuously declares the text to be in ISO 8859-­‐1 when that is not in fact the case. A typical result is where “smart quotes” and some other non-­‐ASCII characters appear as question marks in web browsers unaware of the encoding difference, e.g. 〈“Microsoft® Word™”〉 → 〈?
6
Microsoft? Word??〉.9 Most of the time such incomplete conversion problems are simply termed \OPLO\S, and only rarely do people bother to make these finer distinctions outside of computer programming contexts.
4. TON UBIHEFN HOKDKHCND JNC
A bewildering variety of character sets and encodings were in use during the 1980s and into the 1990s before the wide adoption of Unicode. Some notable examples include the universal but very limited ASCII, the larger ISO 8859-­‐1 also called “Latin-­‐1” used mostly by Unix-­‐based operating systems for supporting Western European languages, the MS-­‐DOS code page CP 437 also for Western European, Mac Roman used on Apple Macintosh com-­‐
puters, ANSEL used by many library systems, and KOI-­‐8 R used on MS-­‐DOS to support Cyril-­‐
lic. Many East Asian scripts are formidably large and consequently various governments established national character sets and encodings. Among them are JIS X 0208 for Japanese, JIS X 0213 which is an extended system for Japanese, GB2312 for SimpliDied Chinese, HKSCS for Cantonese, Big5 for Traditional Chinese, KS X 1001 for Korean, and VISCII for Vietnam-­‐
ese. The ISCII system for encoding languages of India also deserves special mention. None of these character sets has much in the way of compatibility beyond the inclusion of ASCII, and the encoding systems are usually structured in quite different ways. Unicode was in-­‐
tended to replace this alphanumeric soup of character sets and encodings with one single character set that could support all of the world’s writing systems.
Standardization of Unicode began in 1991, although the earliest efforts date back to 1988 (Unicode 2009a:3). The initial motivation behind Unicode was to halt the prolifera-­‐
tion of enormous incompatible character sets and encodings for East Asian languages. Since these languages all share a common source in the Chinese script, termed the HON script in Unicode jargon (Unicode 2009a:380), it was supposed that they could all be unified into something which would be able to represent all the major East Asian languages in a single system, an ideal called UNMuON. The history behind this effort is not of much interest for those unconcerned with East Asian languages, so I will not discuss it further; the essential resources for this topic are Lunde (1999) and Appendix A of the Unicode standard (Unicode 2009a:565). A noteable term arising from this background is the acronym CJK(V) which stands for “Chinese–Japanese–Korean(–Vietnamese)”, usually referring to the Chinese sym-­‐
bols which are found in the writing systems of the languages mentioned, but also to the ty-­‐
pographic conventions common to all scripts based on the Chinese square block model. Vietnamese is sometimes excluded because it uses a modified Latin script called quốc ngữ today, however the historic script 〈字喃〉 chữ Nôm based on Chinese characters still sees use in religious and pedagogical contexts (Unicode 2009a:380).
9. The harshly named “demoroniser” program at http://www.fourmilab.ch/webtools/demoroniser/ is a conversion utility addressing this problem. Windows-­‐1252 uses the unused code points from 8016 to A016 in ISO 8859-­‐1 for additional characters not available in ISO 8859-­‐1, thus creating an incompatibility be-­‐
tween the two character sets and encodings. The similarly numbered code points in the Unicode character set are expressly deDined as the C1 Control Characters and thus cannot be exapted while remaining compli-­‐
ant with the Unicode standard, a policy which alleviates this problem.
7
The UNM]VƒS CVNRVPTM’… is the official organization which publishes the standard for Unicode and manages the changes which are made to it with successive versions. Most ma-­‐
jor computer hardware and software companies are members of the Unicode Consortium, such as Google, Apple, Microsoft, IBM, and Adobe. The government of India is a notable member, as is the University of California Berkeley. A few linguistic organizations are also members, including SIL International, Evertype, and Tavultesoft. Liason members are or-­‐
ganizations which develop standards that have some impact on the design or implementa-­‐
tion of Unicode. The Linguistic Society of America is one of a number of liason members with a significant linguistic interest in Unicode standardization. Membership in the Unicode Consortium is open to the public, but membership is not especially useful except for indi-­‐
viduals who have an extensive and ongoing interest in how the Unicode standardization process progresses.
Since the Unicode Consortium is an organization independent of national or interna-­‐
tional standards bodies, the International Standards Organization (ISO) has adopted Uni-­‐
code as the ISO 10646 standard. This lags behind the actual Unicode Consortium publica-­‐
tions and does not include many aspects of Unicode which are essential for its proper im-­‐
plementation, so the ISO standard is largely ignored in practice.
Unicode began life as a simple 16 bit character encoding with 216 = 65536 characters from 000016 to FFFF16. The stated aim was to “encompass all the characters of the world’s languages”, and it was claimed that “in a properly engineered design, 16 bits per character are more than enough for this purpose”. Like all such limiting assumptions in computing, this was wrong. Fitting even the Chinese script into 65536 characters proved to be a chal-­‐
lenge, much less all the world’s writing systems.
Today Unicode specifies a very large character set with several different but compatible encodings. Although it is sometimes described as simply “32 bit”, this is a misnomer since the complex internal structure of Unicode does not give 232 characters (about 4 billion) but instead only about 1.1 million characters. As of Unicode 5.2.0 there are 246,943 code points allocated for some purpose with the rest reserved, and of that number 107,156 are graphi-­‐
cal characters intended for writing the world’s languages, 137,468 are set aside for private use, and the remainder have specialized purposes such as formatting, control codes, and surrogate characters.
4.1. SCDGHCGDN KBF FNJHDITCIEB EU CON UBIHEFN HOKDKHCND JNC
The 232 space of Unicode characters is divided up into „[ONSR. Planes are very large collec-­‐
tions of continuous code points. The most commonly used plane is the Basic Multilingual Plane (BMP) which extends from 000016 to FFFF16, and which constitutes the 65536 code points of the original Unicode standard. The following list shows the extant planes as im-­‐
plemented in Unicode 5.2.0.
Plane 1, Basic Multlingual Plane (BMP)
• 00000 ... 0FFFF Plane 2, Supplementary Multilingual Plane (SMP)
• 10000 ... 1FFFF Plane 3, Supplementary Ideographic Plane (SIP)
• 20000 ... 2FFFF Plane 4 to 13 are unused
• 30000 ... DFFFF E0000 ... EFFFF Plane 14, Supplementary Special-­‐purpose Plane (SSP)
•
8
Plane 15, Supplementary Private Use Area-­‐A
• F0000 ... FFFFF Plane 16, Supplementary Private Use Area-­‐B
• 100000 ... 10FFFF reserved
• 11FFFF ... FFFFFF Each plane in Unicode can be divided into L[V]‡R which are small sequences of code points that are dedicated to a particular purpose or script. In practice blocks are only found in the BMP, SMP, SIP, and SSP, although theoretically they could be implemented within any plane. Blocks in the BMP are small and numerous, being mostly devoted to either scripts with relatively small inventories of symbols – e.g. Thai – or to logically related subgroups of symbols within a script with a larger inventory such as Cyrillic. The Latin script has the largest number of blocks allocated to it, however since these blocks are relatively small it does not actually form the largest script in the BMP. Instead that honor goes to the CJK Uni-­‐
fied Ideographs block, which is also the largest block in the BMP. The largest block in any plane is the CJK Unified Ideographs Extension B block which is in the SIP, weighing in at 42,719 code points. A representative sample of blocks in the BMP is given below.
• 0000 ... 007F Basic Latin10
• 0080 ... 00FF Latin-­‐1 Supplement
• 0100 ... 017F Latin Extended-­‐A
• 0180 ... 024F Latin Extended-­‐B
• 0250 ... 02AF IPA Extensions
• 4E00 ... 9FFF CJK UniDied Ideographs (20,991 code points)
• AC00 ... D7AF Hangul Syllables (11,183 code points)
As hinted at previously, there are several different types of characters in Unicode. The primary category is the \PO„uM] ]uOPO]TSP which is what we usually expect to be a charac-­‐
ter. Graphic characters represent written language including nonverbal elements like punc-­‐
tuation, icons like weather symbols and smiley faces, mathematical symbols, and certain kinds of decorative elements such as dingbats and box-­‐drawing elements. Essentially a graphic character can be anything which people will view.
FVP…OT ]uOPO]TSPR represent invisible formatting commands which are meant to be interpreted by display software which presents text to the reader. There are a wide variety of different types of format characters, although the number is small in comparison to graphic characters. Example of format characters include right-­‐to-­‐left or left-­‐to-­‐right print order indicators, the zero-­‐width non-­‐joiner which indicates that ligatures between graphic characters should be suppressed, and an invisible hyphen which indicates where a hyphen-­‐
ation break can be made in a word. Although Unicode specifies a number of useful format-­‐
ting characters, these are usually ignored in favor of more general systems of organizing and displaying text such as HTML, XML, typesetting systems, desktop publishing, word processors, and so forth.
10. This block is often known as “Latin-­‐1” after the ISO 8859-­‐1 standard, which itself includes ASCII. The name of the “Latin-­‐1 Supplement” block is a nod to this jargon term, as the block contains the half of ISO 8859-­‐1 not covered by ASCII. The “-­‐1” is because other ISO 8859 character sets provide repertoires for other Latin-­‐based scripts such as for Central European languages in ISO 8859-­‐2 or “Latin-­‐2”.
9
CVNTPV[ ]VƒS ]uOPO]TSPR are a small set inherited from ASCII as well as a few from other character set standards. They are mostly limited to basic functions for controlling teletypewriters which are still visible on today’s keyboards, such as “carriage return”, “tab”, “line feed”, “backspace”, and so forth. Although they still have some uses in certain software, most control code characters are not extensively used today. Despite this they are not in-­‐
tended for reuse for other purposes, since such uses would break compatibility with other software and would violate the character semantics specified in the Unicode standard. Lin-­‐
guists should resist the temptation to use them for nonstandard purposes, instead using private use characters described below.
Previous character standards did not in general have space reserved for nonstandard uses. In these situations people would use reserved code points for their own nonstandard characters, later causing trouble if the standard changed. To alleviate such problems Uni-­‐
code has dedicated a block and two whole planes to private use. PPM‘OTS ’RS ]uOPO]TSPR have reserved, valid code points and are guaranteed to be processed reliably by programs which claim to be Unicode-­‐compliant. They should be preserved in any conversions and not changed to anything that is not equivalent. Thus they can be used for characters which are not specified by Unicode in the same context as ordinary standardized characters. Many linguistic organizations have made extensive use of private use characters for representing scripts that are not specified by Unicode. In general linguists should consider applying to have new characters standardized rather than relying entirely on private use characters, but many people have successfully used private use characters as an interim representation until the somewhat slow standardization process is completed. If private use characters are specified carefully and used consistently then later conversion to newly standardized char-­‐
acters is often a relatively painless task, particularly using flexible conversion utilities such as SIL International’s TECKit.11
S’PPV\OTS ]uOPO]TSPR are intended for use in the 16 bit UTF-­‐16 encoding of Unicode. They have no function alone, but must occur in pairs which are known as surrogate pairs. The structure of surrogate pairs and their use will be detailed later in the discussion of UTF-­‐16. For now the important point is that they are a special category of characters with code points in the BMP that are not used outside of the UTF-­‐16 encoding.
The last two types of characters in Unicode are the reserved characters and noncharac-­‐
ters. RSRSP‘Sƒ ]uOPO]TSPR are those which are not assigned yet, but which may be in some future version of Unicode. Use of reserved characters is prohibited because their functions and semantics in future versions of Unicode are unpredictable. Software comply-­‐
ing to the Unicode standard is required to preserve reserved characters intact.
NVN]uOPO]TSPR are, as the name implies, not characters. These unusual code points are guaranteed to never be used in the future, unlike reserved characters. The code point FFFF16 is a noncharacter which, as the hexadecimal value implies, is composed of all binary ones. It is the highest code point in the Basic Multilingual Plane and was hence set aside as a sort of “fencepost” indicating the end of the BMP. The code point FFFE16 is also a nonchar-­‐
acter, intended to be used as a detector of the order in which bytes are stored in a computer, 11. Available at http://scripts.sil.org/TECkit.
10
an element often called a “byte order mark”. Since some systems store bytes from largest to smallest (LM\-­‐SNƒMON, e.g. PowerPC-­‐based systems) and others store the same bytes from smallest to largest ([MTT[S-­‐SNƒMON, e.g. Intel-­‐based systems) this can be a processing prob-­‐
lem for portable programs. If a program expects to find FFFE16 in a certain location in a file but instead finds FEFF16 then the program knows that the computer has an opposite byte order from the one on which the file was created. By making the FFFE16 code point a non-­‐
character Unicode supports this technique. Most other noncharacters in Unicode have simi-­‐
lar technical purposes.
The description of Unicode characters is itself standardized. Code points are repre-­‐
sented with U+xx... where xx... is a hexadecimal number. The “U+” prefix is sometimes omit-­‐
ted to save space, but it is more commonly used than not and helps to distinguish Unicode code points from arbitrary hexadecimal numbers and from descriptions of code points in other character sets. Most characters have names as well as code points, and these are usu-­‐
ally written in all uppercase or in small caps.12 An exception to this naming practice is the CJKV characters which do not have names. ⟦FIXME: Why?⟧ Character names have a stan-­‐
dard structure that is mostly consistent except for a few abberant instances that date to th early days of Unicode standardization. Latin letters are named “Latin” or “Latin letter” ini-­‐
tially, the term “capital” is used for uppercase, and the term “small” is used for lowercase. Diacritics are described after the base letter description. The following examples show how characters are referred to in most contexts.
U+00E9 LOTMN S…O[[ LSTTSP E WMTu A]’TS
• é U+0294 LOTMN LSTTSP G[VTTO[ STV„
• ʔ • ㉀ U+3240 POPSNTuSRM›Sƒ IƒSV\PO„u FSRTM‘O[
• ! U+103C9 O[ƒ PSPRMON SM\N A’PO…O›ƒOO-­‐2
4.2. COKDKHCND HKCNPEDINJ KBF TDETNDCINJ
⟦FIXME: To do.⟧
5. UBIHEFN HOKDKHCND NBHEFIBPJ
The Unicode character set is speciDied as a series of code points from 0 to 232, with a com-­‐
plex structure of planes and blocks. Code points are four bytes in length, though most have the upper two bytes as just zero which are omitted in citation, e.g. U+00E9 is actually U+000000E9. The basic character encoding of Unicode is just the code points themselves, thus 32 bits (4 bytes) per character, and this encoding is called Unicode Transfer Format 32, abbreviated as UTF-­‐32. This encoding is not widely used; most characters only need 8 to 10 bits to differentiate them from other characters so using 32 bits for each and every charac-­‐
ter requires a large number of bytes full of zeros. It is however used internally in several Unix-­‐based operating systems as well as some specialized software because the encoding algorithm is trivially simple.
12. Curiously, the Unicode standard is actually not entirely consistent on how character names are written. All uppercase is used in many contexts but all lowercase in references to other characters in the code charts. Titlecased small caps is used in this article for typographic aesthetics.
11
The UTF-­‐16 encoding uses only two bytes per character. In UTF-­‐16 any character with a code point below U+FFFF is simply represented using the code point; this covers all the characters in the BMP. Characters with code points above U+FFFF require a more complex method for encoding. The solution is what is called a R’PPV\OTS „OMP, two special charac-­‐
ters which are used to represent higher code points. Each pair is made up of two R’PPV-­‐
\OTS ]uOPO]TSPR, the leading surrogate being in the range U+D800 ... U+DBFF and the trailing surrogate being in the range U+DC00 ... U+DFFF. For example the cuneiform symbol 〈"〉 U+103A0 O[ƒ PSPRMON SM\N A is a character in the SMP with a value well above U+FFFF. The UTF-­‐16 encoding of this character is the sequence D80016 DFA016 whereas it would be simply 000103A016 in UTF-­‐32. In contrast a character below U+FFFF is encoded as its code point, so 〈é〉 U+00E9 LOTMN S…O[[ LSTTSP E WMTu A]’TS is encoded as 00E916 in UTF-­‐16. For much Latin text the UTF-­‐16 encoding is also wasteful although certainly less so than UTF-­‐32, and for this reason other encodings are more commonly used to conserve space. Nevertheless, UTF-­‐16 is a popular encoding for processing Unicode text, and is the internal encoding used by Microsoft Windows, Mac OS X, and the Java virtual machine.
The UTF-­‐8 encoding is the most space-­‐efficient of the common Unicode encodings. For characters in the Basic Latin block the UTF-­‐8 encoding uses only one byte per character. Many other characters are encoded using two bytes, and the system extends to three and four bytes per character for the higher code points of more uncommon characters. The number of bits per character, the ]uOPO]TSP [SN\Tu, thus varies from one code point to the next. The following list shows the breakdown of character lengths depending on code points.
1 byte per character
• U+0000 ... U+007F U+0080 ... U+07FF 2 bytes per character
•
3 bytes per character
• U+0800 ... U+FFFF 4 bytes per character
• ≥ U+10000 This may seem like a relatively complicated system, and indeed it is. Nonetheless, it is regular and systematic such that implementations of it are consistent from one program to the next. There are a number of advantages to the UTF-­‐8 encoding as well. In particular, all ASCII text is exactly the same as UTF-­‐8 text, meaning that there is perfect backward com-­‐
patibility with the vast majority of digital text already extant. A subtle but vital feature of UTF-­‐8 is that no characters above U+007F are ever represented using sequences equivalent to ASCII characters, meaning that round-­‐trip encoding always preserves at least the ASCII characters even if other characters are mangled. Also, the most frequently used code points require the least number of bytes, so that the system is much more space-­‐efficient than other encodings. A few examples given below help to illustrate the results of UTF-­‐8 encod-­‐
ing.
• a U+0061 within U+0000 … U+007F so 1 byte → 61
• α U+03B1 within U+0080 … U+07FF so 2 bytes → CE B1
•  U+05D0 within U+0080 … U+07FF so 2 bytes → D7 90
• अ U+0905 within U+0800 … U+FFFF so 3 bytes → E0 A4 85
• あ U+3042 within U+0800 … U+FFFF so 3 bytes → E3 81 82
12
• ! U+1202A beyond U+FFFF so 4 bytes → F0 92 80 AA
The UTF-­‐8 encoding algorithm is somewhat difficult for many people to understand and process mentally, but it is very well supported by most software systems today. There is thus little need for people to understand its workings since programs handle the encoding transparently, although the occasional bug may surface in which it is useful to know how UTF-­‐8 works as a background to diagnosing the problem. It has become the preferred en-­‐
coding for Unicode text, and is widely used in HTML, XML, most word processors, most e-­‐
mail programs, most database systems, and many other types of software. All modern op-­‐
erating systems have complete support for UTF-­‐8 text, and hence transferring text between systems is usually painless, at least in the narrow context of text encoding.
6. UBIHEFN LNMEBF HOKDKHCNDJ
The public perception of Unicode, even among many computer programmers, is that it es-­‐
sentially consists of a character set and a handful of character encodings. In fact the Uni-­‐
code standard actually includes discussion of many things besides just character sets and encoding issues. Since Unicode attempts to support every known writing system it must describe each in turn, and for some writing systems the Unicode standard is the only easily accessible documentation available in English. In addition there are many speciDic issues involving how characters must be processed and how digital text must be represented which are essential components of the Unicode standard that extend well beyond the speci-­‐
Dication of particular characters and encoding systems.
6.1. CEZLIBIBP HOKDKHCNDJ KBF BEDZK[I\KCIEB UEDZJ
An essential concept in Unicode is the idea of ]V…LMNMN\ ]uOPO]TSPR which are graphic characters intended to combine with other LORS ]uOPO]TSPR to form a coherent textual element. The combining characters are mostly diacritics, but also include more unusual things like circles which are meant to surround other characters. An example is the combin-­‐
ing character 〈´〉 U+0301 CV…LMNMN\ A]’TS A]]SNT. When combined with a base character it appears above the base as an acute accent diacritic. Thus the base character 〈e〉 U+0065 LOTMN S…O[[ LSTTSP E can be combined with U+0301 to form 〈é〉 which is just the sequence of characters U+0065 U+0301. Note that in Unicode the combining characters always follow their base characters, never the reverse. There is essentially no limit on the number of combining characters although fonts and display systems usually only handle a few before proper positioning of the combining characters starts to fail and the result becomes un-­‐
pleasant or illegible. Despite such display failures, as long as characters are coded properly the data is still functional.
Recall the character 〈é〉 U+00E9 LOTMN S…O[[ LSTTSP E WMTu A]’TS which was dis-­‐
cussed in an earlier example. This is a single character which has the exact same appear-­‐
ance as the sequence of U+0065 U+0301 given above, but the two are not identical. Charac-­‐
ters such as U+00E9 are called „PS]V…„VRSƒ characters because they represent an under-­‐
lying sequence of multiple characters with which they are interconvertible. Not all combi-­‐
nations of combining characters and base characters have corresponding precomposed 13
characters, a limitation which can be frustrating but which prevents the Unicode character set from growing beyond its already huge size. An unpleasant side effect of not having all combinations of combining and base characters standardized as precomposed characters is that font designers often ignore the results of arbitrary combinations of combining charac-­‐
ters and base characters, instead focusing only on the precomposed ones. This means that many combining characters display poorly in popular fonts. This is a display issue so that the actual textual data is not affected, but it is nevertheless a nuisance. Orthography design-­‐
ers will often find it far easier to gain both community support and computer software sup-­‐
port if only precomposed characters are used because a far greater number of fonts will be compatible with the new orthography.
Unicode text may include a random mix of precomposed characters and sequences of base+combining characters. Such a mix is termed ƒSNVP…O[M›Sƒ because it is internally inconsistent. Unicode specifies two NVP…O[M›OTMVN ŸVP…R which ensure that all text is rep-­‐
resented consistently. The Normalization Form Canonical Decomposition (NFCD) is one normalization form in which all precomposed characters are represented in decomposed form. In this form 〈é〉 will always be represented as U+0065 U+0301 using the combining acute accent, regardless of what the original characters were before the normalization. This normalization form is notably used by Mac OS X. The Normalization Form Canonical Com-­‐
position (NFCC) is in contrast notably used by Windows and Linux. In this normalization form all decomposed characters are represented in precomposed form where a precom-­‐
posed form exists. Thus 〈é〉 will always be represented as U+00E9, but since 〈v́〉 does not have a precomposed character it will always be represented as a sequence of U+0076 LOTMN S…O[[ LSTTSP V and U+0301 CV…LMNMN\ A]’TS A]]SNT.
The distinction between the two normalization forms is important for users whose text contains many diacritics, as is the case for many linguists. Usually the difference between the two is irrelevant since computer programs typically convert text from one normaliza-­‐
tion form to the other automatically. It is still useful to know whether one’s data is normal-­‐
ized or not, and if so which normalization form it uses, particularly in archival contexts. As-­‐
sumptions of normalization based purely on a user’s operating system may be incorrect since it is not the case that every program on a given operating system uses the same nor-­‐
malization form internally.
6.2. CKJN: GTTND, [E]ND, HKTICK[, JZK[[
In the context of text processing the term ]ORS refers to the systematic distinction of two different letterforms which have different textual functions but are interchangeable in some circumstances. This term derives from when printing was done almost entirely with small metal type blocks, a period stretching from the invention of moveable type until the late 19th century. Metal type blocks, usually just “type”, were conventionally stored in large wooden boxes or drawers called “cases”. A full set of type in a single typeface or “font” usu-­‐
ally came in two of these cases, where one contained the capital letters and punctuation and the other contained the small letters. During use in printing shops the case of capital letters was placed in a rack above the case containing the small letters, hence the terms ’„-­‐
„SP]ORS and [VUSP]ORS. This may sound like a fanciful explanation but it is in fact abso-­‐
14
lutely true! Equivalent terms for upper-­‐ and lowercase are …O†’R]’[S/…MN’R]’[S and ]O„MTO[/R…O[[, the latter of which is standard in the naming of Unicode characters.
There are a number of alphabets around the world which include a case distinction. The Latin alphabet is the most obvious example, and it is the one in which the distinction was first developed. Greek and Cyrillic have cases which are largely equivalent to those in the Latin alphabet although usage may vary somewhat from Latin practices, and similar case distinctions occur in Armenian, Coptic, Glagolitic, and Deseret. The distinction between hi-­‐
ragana and katakana in the Japanese writing system is sometimes analogous to case, al-­‐
though it often also parallels the use of italic versus roman font variants. The Latin, Greek, and Cyrillic alphabets have had their systems of case extended to letters which did not originally have case distinctions, such as 〈Ə〉 U+018F LOTMN CO„MTO[ LSTTSP S]uUO versus 〈ə〉 U+0259 LOTMN S…O[[ LSTTSP S]uUO, and 〈Ɂ〉 U+0241 LOTMN CO„MTO[ LSTTSP G[VTTO[ STV„ versus 〈ɂ〉 U+0242 LOTMN S…O[[ LSTTSP G[VTTO[ STV„.13 Unicode specifies ]ORS „OMPR between two characters so that programmatic changing of case works reliably. Thus text containing 〈ḡ〉 U+1E21 LOTMN S…O[[ LSTTSP G WMTu MO]PVN can be automatically upper-­‐
cased to 〈Ḡ〉 U+1E20 LOTMN CO„MTO[ LSTTSP G WMTu MO]PVN by a word processor when pasted in an all-­‐uppercase section heading, for example.
Linguists often make implicit and unconsidered assumptions about case when design-­‐
ing orthographies using alphabets that include case distinctions. In the context of Unicode it is important for orthography designers to consider case pairing explicitly and to test case pairing assumptions in a few common computer programs. It is essential that orthography designers do not assign mismatched case pairs which violate existing Unicode standardiza-­‐
tion, so for example 〈Ḧ〉 should not be mismatched with 〈h̤〉 but instead with 〈ḧ〉 even though the latter might be less aesthetically pleasing. Although it is possible to have un-­‐
usual case pairings standardized by Unicode, the process is arduous and best avoided. It is also a good idea for orthography designers to avoid characters that have missing case pairs such as 〈ẘ〉 U+1E98 LOTMN S…O[[ LSTTSP W WMTu RMN\ ALV‘S because users will quickly Dind the disparity between cases to be a nuisance in their attempts to use it on computers.
6.3. SEDCIBP KBF HE[[KCIEB
CV[[OTMVN is the ofDicial Unicode term for RVPTMN\, “the process of ordering units of textual information” (Unicode 2010). Although it may be assumed that all Latin alphabets are sorted in the same way, this is a fallacy. Sorting is actually speciDic to languages rather than to writing systems. For example the Swedish alphabetical order ends with 〈... X Y Z Å Ä Ö〉 but the Danish and Norwegian alphabets end instead with 〈... X Y Z Æ Ø Å〉 where the letters 〈Æ〉 and 〈Ø〉 are orthographic equivalents of the letters 〈Ä〉 and 〈Ö〉. Even for a single lan-­‐
guage the conventional sorting order is contextual; in most German dictionaries the differ-­‐
ences between 〈A〉/〈Ä〉, 〈O〉/〈Ö〉, 〈U〉/〈Ü〉 and 〈ß〉/〈SS〉 are ignored in sorting, but in tele-­‐
13. Note that uppercase and lowercase glottal stop are distinct from the phonetic transcription glottal stop which is 〈ʔ〉 U+0294 LOTMN LSTTSP G[VTTO[ STV„ instead. The cased forms are for use in orthographies, whereas the caseless form is for use in transcription. There are several other such pairs and triplets in the Unicode repertoire of phonetic symbols. Linguists should be aware of them, especially when designing or-­‐
thographies.
15
phone directories 〈Ä〉 is treated as if it was 〈AE〉, 〈Ö〉 as 〈OE〉, and so forth. English sorting usually ignores differences in case, but many computer programs distinguish uppercase and lowercase in sorting with either uppercase preceding lowercase or vice versa.
Unicode specifies basic sorting practices with the Unicode Collation Algorithm (Unicode 2009b). This standard provides a default sorting order or ]V[[OTMVN RS•’SN]S for all Uni-­‐
code text, an especially important issue when textual data contains multiple languages. Other collation sequences are possible but they usually require extensive programming ex-­‐
perience to implement, as well as the need to for other programs to support additional col-­‐
lation sequences. Consequently orthography designers should avoid inventing new colla-­‐
tion sequences. Ideally the collation sequence specified by the Unicode Collation Algorithm should be used instead. A less ideal but still workable solution is to use a collation sequence employed by a local majority language with a similar script, or one from some other lan-­‐
guage which has a similar orthography and is already well supported. Thus one might im-­‐
plement an orthography with 〈Ä〉 and 〈Ö〉 in an English-­‐dominant region, but rather than using the default Unicode collation sequence or the local English collation sequence, the or-­‐
thography developer might specify that the German telephone directory collation sequence is to be used instead. Actually getting this to work in disparate computer programs would probably still be a challenge.
7. PDKHCIHK[ IJJGNJ ]ICO UBIHEFN
Using Unicode is easy in most respects. Today nearly all popular computer programs sup-­‐
port Unicode, usually as a result of support in the underlying operating system. Most lin-­‐
guistic software supports Unicode without any particular modiDications or customization needed other than selecting fonts which cover the appropriate Unicode characters. Never-­‐
theless, there are some issues which arise from using Unicode in practice which need care-­‐
ful consideration.
7.1. TON DIPOC HOKDKHCND UED CON `EL
The most important thing to keep in mind when using Unicode for representing linguistic data, or when adapting or designing an orthography, is to use characters for their intended purposes. There are many characters in the Unicode character set which have similar ap-­‐
pearances but the standard speciDies different semantics for each. The most obvious dis-­‐
tinction is between similar symbols in different scripts, for example Latin 〈A〉 U+0041 LOTMN CO„MTO[ LSTTSP A versus Cyrillic 〈А〉 U+0410 CQPM[[M] CO„MTO[ LSTTSP A and Greek 〈Α〉 U+0391 GPSS‡ CO„MTO[ LSTTSP A[„uO. More subtle distinctions arise in phonetic transcrip-­‐
tion, and the Unicode standard often includes guidelines on how certain characters are meant to be used. The description of the Phonetic Extensions block from U+1D00 to U+1D7F includes the following statement (Unicode 2009a:171):
The small capitals, superscript, and subscript forms are for phonetic repre-­‐
sentations where style variations are semantically important. For general text, use regular Latin, Greek, or Cyrillic letters with markup instead.
16
What this means is that characters like 〈ᴀ〉 U+1D00 LOTMN LSTTSP S…O[[ CO„MTO[ A should not be used for ordinary small caps, but instead should only be used as a phonetic symbol as for example in some early 20th century Americanist transcriptions of the IPA vowel [ʌ]. Small caps as used in gloss abbreviations should instead be implemented using a word processor or font-­‐speciDic mechanism rather than the small capital Unicode characters.
A related issue is the mixing of writing systems at the word level, something which in general should be avoided. There are certain traditions of using multiple scripts in single words, such as the Slavicist tradition of using Cyrillic 〈ь〉 U+044C CQPM[[M] S…O[[ LSTTSP SVŸT SM\N and 〈ъ〉 U+044A CQPM[[M] S…O[[ LSTTSP HOPƒ SM\N as in transliterated Old Church Slavonic 〈krьstъ〉 /krĭstŭ/ “cross”. Americanist transcriptions from the early 20th century also used Greek letters mixed with Latin letters in phonetic transcription, for ex-­‐
ample Tlingit 〈ʼὰwὺłὶsίn〉 for something like IPA [ʔʌ̀wʊ̀ɬɪ̀sɪ́n] “he hid it” (Boas 1917:33). Al-­‐
though such symbols are no longer in use, it may be desirable to use these in reprints and digitizations of older works. In the absence of scholarly tradition or the need for represent-­‐
ing obsolete transcriptions, linguists should generally avoid the mixing of scripts at the word level. Unicode directly supports this by supplying Latin equivalents of characters that have been borrowed from other writing systems, for example 〈ɛ〉 U+025B LOTMN S…O[[ LSTTSP O„SN E in place of 〈ε〉 U+03B5 GPSS‡ S…O[[ LSTTSP E„RM[VN, 〈ɩ〉 U+0269 LOTMN S…O[[ LSTTSP IVTO replacing 〈ι〉 U+03B9 GPSS‡ S…O[[ LSTTSP IVTO, and 〈ɣ〉 U+0263 LOTMN S…O[[ LSTTSP GO……O for 〈γ〉 U+03B3 GPSS‡ S…O[[ LSTTSP GO……O.
Linguists have historically been quite opportunistic in using unusual letters as symbols, and probably the most unfortunate is the use of 〈Ø〉 U+00D8 LOTMN CO„MTO[ LSTTSP O WMTu STPV‡S and 〈ø〉 U+00F8 LOTMN S…O[[ LSTTSP O WMTu STPV‡S as a symbol for “zero” or “null”, along with occasional occurrences of 〈ϕ〉 U+03D5 GPSS‡ PuM SQ…LV[ or even 〈φ〉 U+03C6 GPSS‡ S…O[[ LSTTSP PuM. None of these are intended for use as null symbols, and all but U+03D5 are letters rather than actual symbols. The preferred character in Unicode is instead 〈∅〉 U+2205 E…„TQ SST, which has the following comments in the standard (Uni-­‐
code 2009a:215):
2205 ∅ EMPTY SET
= null set
• used in linguistics to indicate a null morpheme or phonological “zero”
→ 00D8 Ø latin capital letter o with stroke
→ 2300 ⌀ diameter sign
Each of these comments has a speciDic implication for the semantics of the character. The Dirst is an equivalence comment, meaning that this character is equivalently known as “null set” although that is not the standardized name. The following bullet point provides usage information. In addition to its use as a mathematical symbol for the null set or empty set, this comment recommends its use as the zero or null symbol in linguistics. The following two comments are contrastive references to other characters which should be distin-­‐
guished from U+2205. Referring to 〈ø〉 U+00F8 LOTMN S…O[[ LSTTSP O WMTu STPV‡S the comments include “• Danish, Norwegian, Faroese, IPA” (Unicode 2009a:10) which clariDies 17
that this character is meant for use in IPA and hence other phonetic transcriptions, but it does not suggest the use of this character for the empty set. Indeed, 〈Ø〉 U+00D8 LOTMN CO„MTO[ LSTTSP O WMTu STPV‡S includes the comment “→ 2205 ∅ empty set” (Unicode 2009a:10) which is meant to inform the reader that U+00D8 should not be used for this purpose and that U+2205 should be used instead.
The usual complaint about U+2205 as a zero or null symbol in linguistics is that it does not “look right”. In mathematical text the customary form of the empty set is a geometric circle with a line bisecting it at 45 degrees. In contrast the customary linguistic form is a number zero with a line bisecting it at the stress angle of the typographic form. Fortunately as font developers become more sensitive to the typographic diversity there are more fonts available with the form that linguists prefer. OpenType fonts often offer an alternate form which can be enabled using standard typographic options in software which fully supports OpenType. Thus SIL International’s Doulos SIL font has a default form of 〈∅〉 for U+2205 which is the usual mathematical form, however the option titled “Empty Set as Slashed Zero” produces the form 〈,〉 instead, much more to the linguist’s liking.14 The default form of U+2205 in Cambria15 is 〈∅〉 which is also much closer to what linguists usually expect to see. Thus it is entirely possible to use the correct Unicode character while at the same time using the customary linguistic form.
Another common example of character misuse is the use of punctuation characters 〈‘〉 U+2018 LSŸT SMN\[S Q’VTOTMVN MOP‡ and 〈’〉 U+2019 RM\uT SMN\[S Q’VTOTMVN MOP‡ in orthographies, usually representing glottal stops, ejective consonants, or palatalization. A related misused character is 〈'〉 U+0027 A„VRTPV„uS. Because Unicode specifies different character categories for punctuation versus letters, many programs make a distinction in how they treat punctuation in contrast with other types of text. For example, most operat-­‐
ing systems will exclude punctuation when selecting text with a double-­‐click action, “words” are usually delimited by punctuation (including whitespace) in word processors, and sorting/collation algorithms usually ignore punctuation.
Unicode offers a few characters intended for use in orthographies where symbols akin to quotation marks are used as letters. The characters 〈ʻ〉 U+02BB MVƒMŸMSP LSTTSP T’PNSƒ CV……O and 〈ʼ〉 U+02BC MVƒMŸMSP LSTTSP A„VRTPV„uS are for orthographies which require a 9-­‐shaped or 6-­‐shaped apostrophe as a letter. The comments for these two charac-­‐
ters are instructive (Unicode 2009a:28):
02BB ʻ MODIFIER LETTER TURNED COMMA
• typographical alternate for 02BD ʽ or 02BF ʿ
• used in Hawaiʻian orthography as ʻokina (glottal stop)
→ 0312 ◌̒ combining turned comma above
14. Since Doulos SIL and its cousin Charis SIL are primarily used by linguists, the SIL font maintainers may design future versions which default to the slashed-­‐zero form of U+2205 rather than the circle-­‐slash form. At the time of writing the latest versions are 4.106 which default to the circle-­‐slash.
15. The Cambria font is one of a collection of fonts whose names begin with “C” that are included in Windows Vista and other newer Microsoft products; they are also bundled with the free PowerPoint 2007 Viewer.
18
→ 07F5 ʻ nko low tone apostrophe
→ 2018 ‘ left single quotation mark
02BC ʼ MODIFIER LETTER APOSTROPHE
• glottal stop, glottalization, ejective
• many languages use this as a letter of their alphabets
• used as a tone marker in Bodo, Dogri, and Maithili
• 2019 ’ is the preferred character for a punctuation apostrophe
→ 0027 ' apostrophe
→ 0313 ◌̓ combining comma above
→ 0315 ◌̕ combining comma above right
→ 055A ʼ armenian apostrophe
→ 07F4 ʼ nko high tone apostrophe
→ 1FBF ᾿ greek psili
→ 2019 ’ right single quotation mark
Here the standard is indicating explicitly that in Hawaiian, and thus in other similar orthog-­‐
raphies, the modiDier letter U+02BB should be used rather than the quotation mark U+02BB. The standard also states that the modiDier letter U+02BC is the intended character for glottal stops, glottalization, and ejectives, rather thn the quotation mark U+2019.
There are a number of orthographic communities which prefer a straight “typewriter-­‐
style” apostrophe rather than a curved 6-­‐shaped or 9-­‐shaped apostrophe. In such instances it is tempting to use 〈'〉 U+0027 A„VRTPV„uS for this letter, but again the abuse of punctua-­‐
tion is problematic. In version 5.1 Unicode added two new letters to its repertoire for this exact purpose: 〈Ꞌ〉 U+A78B LOTMN CO„MTO[ LSTTSP SO[TM[[V and 〈ꞌ〉 U+A78C LOTMN S…O[[ LSTTSP SO[TM[[V. These are intended to have an upright vertical appearance similar to that of the typewriter apostrophe, but which are instead actual letters rather than punctuation.
⟦FIXME: Suggestions on how to orthographically implement such characters for general computer use. Keyboard maps and the programs for them. Placement on the keyboard. Al-­‐
ternate access to 〈'〉 U+0027 A„VRTPV„uS.⟧
7.2. MIJJIBP HOKDKHCNDJ
In nearly all cases linguists will Dind that any character they might want is already standard-­‐
ized or can be constructed from an existing base plus combining diacritics. However there are occasional lacunae where Unicode does not have an existing character and one cannot be constructed from existing characters.
If Unicode is missing an entire script then the first thing that a concerned linguist should do is check the Unicode Roadmap 16 and the quaintly named Not The Roadmap17 to see whether or not a proposal exists for the script already. The Not The Roadmap lists a few scripts where so little is known by the Unicode Consortium that no pre-­‐allocation of blocks can be made for them. Another source of more detailed information on proposed scripts is 16. http://unicode.org/roadmaps/
17. http://unicode.org/roadmaps/not-­‐the-­‐roadmap
19
the Proposed New Scripts18 registry which lists many different proposals falling some-­‐
where between “exploratory” and “pending publication”. If there is no evidence of the par-­‐
ticular script anywhere in the future plans of the Unicode Consortium then there are two organizations which should be contacted regarding standardization of a new writing sys-­‐
tem. The Script Encoding Initiative19 is a project of the Department of Linguistics at the University of California, Berkeley which is devoted to the preparation of formal Unicode proposals for new scripts and characters. Michael Everson of Evertype 20 is one of the edi-­‐
tors of the published Unicode standard, and is well known in the Unicode world for his work in proposing new scripts for Unicode.21
Missing characters should be handled in a similar manner. The latest version of the standard should be carefully reviewed to ensure that no existing character could serve ade-­‐
quately given an appropriate realization in a new font. It is often the case that people feel-­‐
ing the need for new characters have neglected trying to construct them from a base char-­‐
acter and combining diacritics. The Pipeline22 is a frequently updated registry of characters and scripts in various stages of review before acceptance into a new version of the stan-­‐
dard, and often if there is a felt need for a new character then other communities may have already pushed it through part of the standards process such that it will appear in the Pipe-­‐
line. If the character in question does not appear in the Pipeline then the archive of Notices of Non-­‐Approval23 should be reviewed, since the character may have been previously de-­‐
cided against. Appearance in that archive should not be taken as a sign of absolutely final rejection, since there are several examples such as 〈ẞ〉 U+1E9E LOTMN CO„MTO[ LSTTSP SuOP„ S which were once rejected but later revisited and approved. Unicode generally does not accept proposals for new characters which can otherwise be constructed from existing characters, an issue that should be kept in mind when proposing new characters for stan-­‐
dardization. The process of proposing a new character is detailed on the Unicode website.24
8. FGDCOND DNKFIBP
The Unicode standard is a surprisingly readable document when compared to most com-­‐
puting standards. It is available for free as a download from the Unicode Consortium’s web-­‐
site, either as a number of small PDFs of individual chapters and code charts, or as a single monolithic PDF in a compressed ZIP Dile. Certain versions of the standard are also made available as a printed book for a not unreasonable price when considering the size and scope of the publication.
The Wikipedia articles on text encoding are, like most Wikipedia articles on computing topics, quite thorough and well researched with extensive references to standards and both 18. http://unicode.org/pending/pending.html
19. http://www.linguistics.berkeley.edu/sei/index.html
20. http://www.evertype.com/
21. In addition Everson serves as one of two coordinators for the ConScript Unicode Registry of constructed and/or artiDicial scripts such as J.R.R. Tolkien’s Tengwar (“Elvish”) and the Klingon pIqaD used in Star Trek.
22. http://www.unicode.org/alloc/Pipeline.html
23. http://www.unicode.org/alloc/nonapprovals.html
24. http://www.unicode.org/pending/proposals.html
20
official and unofficial descriptions. The “Character sets” category includes articles on spe-­‐
cific character sets and character encodings, whereas the “Character encoding” category mostly consists of articles on more general topics such as mojibake, whitespace, and colla-­‐
tion. As with any references to Wikipedia, these categories are likely to change in the future. Nevertheless, a search for “Unicode” will give the main article from which many other arti-­‐
cles can be explored.
For those interested in character sets and encoding issues in East Asian languages, the book CJKV Information Processing by Ken Lunde (1999) of Adobe is unequaled. This has re-­‐
cently been republished in a second edition which includes extensive discussion of im-­‐
provements to Unicode, the use of OpenType, and the implementation of East Asian text in PDF, HTML, XML, and CSS. This book also provides a solid background in digital text even for those with no particular interest in East Asian writing systems.
9. BIL[IEPDKTOM
Boas, Franz. 1917. Grammatical notes on the language of the Tlingit Indians. (University of Pennsylvania University Museum Anthropological Publications 8.1). Philadelphia: Uni-­‐
versity Museum.
Lunde, Ken. 1999. CJKV Information Processing. Sebastopol, CA: O’Reilly & Associates. ISBN 1-­‐56592-­‐224-­‐7.
Unicode Consortium. 2009a. The Unicode standard, version 5.2.0. Mountain View, CA: The Unicode Consortium. ISBN 978-­‐1-­‐936213-­‐00-­‐9. http://www.unicode.org/ versions/Unicode5.2.0/
Unicode Consortium. 2009b. Unicode technical standard #10: Unicode Collation Algorithm. http://www.unicode.org/reports/tr10/
Unicode Consortium. 2010. Glossary of Unicode terms. http://www.unicode.org/ glossary/
21