Lecture 1: Encoding Languages

Lecture 1: Encoding Language
LING 1330/2330: Introduction to Computational Linguistics
Na-Rae Han
Objectives
 Understand the fundamentals of how language is
encoded on a computer
 Text encoding systems
 ASCII
 ISO-8859
 Unicode
1/12/2017
2
How is language represented on a
computer?
 Natural ("Human")
languages:
 The language of
computers:
 Spoken form
 Written form
*Also: sign languages
1/12/2017
3
The language of computers
 At the lowest level, computer language is binary:
Information on a computer is stored in bits
 A bit is either: ON (=1, =yes) or OFF (=0, =no)
 This language essentially contains
two alphabetic characters
 Next level up: byte
 A byte is made up of a sequence of 8 bits
 ex. 01001101 
 Historically, a byte was the number of bits used to
encode a single character of text in a computer
 Byte is a basic addressable unit in most computer
architecture
1/12/2017
4
Encoding a written language
 How to represent a text with 0s and 1s?
 Hello world!
 01001000011001010110110001101100011011110010000001
1101110110111101110010011011000110010000100001
 Each character is mapped to a code point (=character code),
e.g., a unique integer.
 H  72dec
 e  101dec
 Each code point is represented as a binary number, using a
fixed number of bits.
 8 bits == 1 byte in the example above
 H  72dec  01001000 (26+23 = 64 + 8 = 72)
 e  101dec  01100101 (26+ 25 + 22 + 20= 64 + 32 + 4 + 1 = 101)
 One byte can represent 256 (=28) different characters
 00000000  0dec 11111111  255dec
1/12/2017
5
ASCII encoding for English
 How many bits are needed to encode English?
 26 lowercase letters: a, b, c, d, e, …
 26 uppercase letters: A, B, C, D, E, …
 10 Arabic digits: 0, 1, 2, 3, 4, …
 Punctuation: . , : ; ? ! ' "
 Symbols: ( ) < > & % * $ + -
 We are already up to 80
 6 bits (26 = 64) is not enough; we will need at least 7 (27 = 128)
 ASCII (the American Standard Code for Information
Interchange) did just that
 Uses 7-bit code (= 128 characters) for storing English text
 Range 0 to 127
1/12/2017
6
The ASCII chart
 https://en.wikipedia.org/wiki/ASCII
 http://web.alfredstate.edu/weimandn/miscellaneous/ascii/AS
CII%20Conversion%20Chart.pdf
Decimal
Binary (7-bit)
Character
0
000 0000
(NULL)
…
…
…
35
010 0011
#
36
010 0100
&
…
…
…
48
011 0000
0
49
011 0001
1
50
011 0010
2
…
…
…
1/12/2017
Decimal Binary (7-bit)
Character
65
100 0001
A
66
100 0010
B
67
100 0011
C
…
…
…
97
110 0001
a
98
110 0010
b
99
110 0011
c
…
…
…
127
111 1111
(DEL)
7
ASCII
(the American Standard Code for Information
Interchange)
 The ASCII encoding scheme
 First published in 1963
 Uses 7-bit code (= 128 characters) for storing English text,
ranging from 0 to 127
 In an 8-bit (1 byte) representation, the highest bit is always 0
 Printable characters
 Upper and lower case roman alphabet
 Digits
 Punctuation marks, symbols, and space
 Includes 32 non-printing characters
 Control characters: BELL, ACKNWOLEDGE, BACKSPACE, DELETE, etc. 
originally for typewriters, many obsolete now
 WHITESPACE characters: TAB, LINE FEED, CARRIAGE RETURN
1/12/2017
8
Practice
 What is this English text?
 Note: byte (=8-bit) ASCII representation instead of 7-bit
 Space provided for your convenience only!
01001000 01101001 00100001
 Answer:
Hi!
1/12/2017
9
Extending ASCII: ISO-8859, etc.
 ASCII (=7 bit, 128 characters) was sufficient for encoding
English. But what about characters used in other
languages?
 Solution: Extend ASCII into 8-bit (=256 characters) and use
the additional 128 slots for non-English characters
 ISO-8859: has 16 different implementations!
 ISO-8859-1
aka Latin-1: French, German, Spanish, etc.
 ISO-8859-7
Greek alphabet
 ISO-8859-8
Hebrew alphabet
 JIS X 0208: Japanese characters
 Problem: overlapping character code space.
224dec means à in Latin-1 but ‫ א‬in ISO-8859-8!
1/12/2017
10
The problem with multiple encoding
systems
Problem: Multiple coding systems map different characters
to the same character code
 Solution 1: Provide meta-information on coding system
 Ex. MIME (Multipurpose Internet Mail Extensions)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
 But what if your message contains characters from multiple
coding systems?
 Solution 2: Have a single universal code system for all
writing systems  UNICODE
1/12/2017
11
Unicode
 A character encoding standard developed by the Unicode
Consortium
 Provides a single representation for all world's writing
systems
 "Unicode provides a unique number for every character, no
matter what the platform, no matter what the program, no
matter what the language.”
(http://www.unicode.org)
1/12/2017
12
How big is Unicode?
 Version 9.0 (2016) has codes for 128,237 characters
 Full Unicode standard uses 32 bits (4 bytes) : it can represent
232 = 4,294,967,296 characters!
 In reality, only 21 bits are needed
 Unicode has three encoding versions
 UTF-32
(32 bits/4 bytes): direct representation
 UTF-16
(16 bits/2 bytes): 216=65,536 possibilities
 UTF-8
(8 bits/1 byte): 28=256 possibilities
1/12/2017
13
8-bit, 16-bit, 32-bit
 UTF-32
 UTF-16
 UTF-8
(32 bits/4 bytes): direct representation
(16 bits/2 bytes): 216=65,536 possibilities
(8 bits/1 byte): 28=256 possibilities
 Wait! But how do you represent all of 232 (=4 billion) code
points with only one byte (UTF-8: 28 =256 slots)?
 You don't.
 In reality, only 221 bits are ever utilized for 128K characters.
 UTF-8 and UTF-16 use a variable-width encoding.
 Why UTF-16 and UTF-8?
 They are more compact (more so for certain languages, i.e.,
English)
1/12/2017
14
Variable-width encoding
 'H' as 1 byte (8 bits):
cf. 'H' as 2 bytes (16 bits):
01001000
0000000001001000
 UTF-8 as a variable-width encoding
 ASCII characters get encoded with just 1 byte
 ASCII is originally 7-bits, so the highest bit is always 0 in an 8-bit
encoding
 All other characters are encoded with multiple bytes
 How to tell? The highest bit is used as a flag.
 Highest bit 0: single character
É
 Highest bit 1: part of a multi-byte character
01001000 11001001 10001000 01101001 01101001
 Advantage for English: 8-bit ASCII is already a valid UTF-8!
1/12/2017
15
A look at Unicode chart
 How to find your Unicode character:
 http://www.unicode.org/standard/where/
 http://www.unicode.org/charts/
 Basic Latin (ASCII)
 http://www.unicode.org/charts/PDF/U0000.pdf
1/12/2017
16
Code point
for M.
But "004D"?
1/12/2017
17
Another representation: hexadecimal
Hexadecimal (hex) = base-16
 Utilizes 16 characters:
0123456789ABCDEF
 Designed for human readability & easy byte conversion
 24=16: 1 hexadecimal digit is equivalent to 4 bits
 1 byte (=8 bits) is encoded with just 2 hex chars!
Letter
Base-10
(decimal)
Base-2
(binary)
Base-16
(hex)
M
77
0000 0000 0100 1101
004D
 Unicode characters are usually referenced by their hexadecimal code
 Lower-number characters go by their 4-char hex codes (2 bytes), e.g.
U+004D ("M", U+ designates Unicode)
 Higher-number characters go by 5 or 6 hex codes, e.g. U+1D122
(http://www.unicode.org/charts/PDF/U1D100.pdf)
1/12/2017
18