Character representation

Character Representation
2/2/2009
Goals for Today
Characters
h
• Understand how character data is represented and displayed.
d di l d
COMP370
I t d ti t C
Introduction to Computer Architecture
t A hit t
ASCII
Bits are Bits
• A bunch of bits can represent many things, numbers, logical values or characters.
numbers, logical values or characters.
/* C++ character used in different ways. */
char stuff;
stuff = ‘A’;
A ;
stuff = stuff + 10;
if (stuff) { do something; }
COMP370 Intro to Computer Architecture
• 7 bit ASCII includes printable and non‐printable characters.
00
01
02
03
04
05
06
07
08
09
0a
0b
0c
0d
0e
0f
nul
soh
stx
etx
eot
enq
ack
bel
bs
ht
nl
vt
np
cr
so
si
10
11
12
13
14
15
16
17
18
19
1a
1b
1c
1d
1e
1f
dle
dc1
dc2
dc3
dc4
nak
syn
etb
can
em
sub
esc
fs
gs
rs
us
20 sp 30
21 ! 31
22 " 32
23 # 33
24 $ 34
25 % 35
26 & 36
27 ' 37
28 ( 38
29 ) 39
2a * 3a
2b + 3b
2c , 3c
2d - 3d
2e . 3e
2f / 3f
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
40
41
42
43
44
45
46
47
48
49
4a
4b
4c
4d
4e
4f
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
50
51
52
53
54
55
56
57
58
59
5a
5b
5c
5d
5e
5f
P
Q
R
S
T
U
V
W
X
Y
Z
[
₩
]
^
_
60
61
62
63
64
65
66
67
68
69
6a
6b
6c
6d
6e
6f
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
70 p
71 q
72 r
73 s
74 t
75 u
76 v
77 w
78 x
79 y
7a z
7b {
7c |
7d }
7e ~
7f del
1
Character Representation
2/2/2009
Control Characters
• The first 32 ASCII values are control characters
• They were originally intended not to carry They were originally intended not to carry
printable information, but rather to control devices and communication.
Name
Hex
LF
BEL
CR
SOH
DEL
0A
07
0D
01
7F
Purpose
Line feed, move paper down one line
,
p p
Ring the bell
Carriage Return, move to beginning of line
Start Of Header for sending packets
Delete previous character
Modern Inconsistencies
• The original standard for control characters was somewhat ambiguous.
h t
bi
• Different companies had different interpretations that persist to today.
• What is a new line character, ‘\n’ ?
Ancient Teletype Terminal
• An old teletype needed control for printing and paper tape.
• del punched holes over the previous character
• nul was used to give the printer time to physically move
Parity
• Early computer systems used only 7 bits for ASCII characters instead of the full byte.
• The Most Significant Bit was used for error The Most Significant Bit was used for error
detection in communications.
• The parity bit was set to the XOR of the data bits
• If the calculated parity bit was not the same as the received parity bit, an error has occurred.
7 bit ASCII
1010110
0001110
COMP370 Intro to Computer Architecture
Parity bit ASCII w/ parity
0
01010110
1
10001110
2
Character Representation
2/2/2009
Unicode
• The ASCII character set only has 128 characters or 256 if all 8 bits are used.
256 if ll 8 bit
d
• ASCII does not provide characters for other national languages, such as Japanese or Arabic.
• The Unicode character set has 16 bit characters
• The first 128 Unicode characters are the same The first 128 Unicode characters are the same
as ASCII.
Characters and Fonts
Other Character Sets
• IBM computers used to use the EBCDIC character set. It use 8 bits per character.
h
t
t It
8 bit
h
t
• Early Univac computer used the Fieldata
character set. It used on 6 bits per character.
Fonts
• The bit pattern or number value of a character d fi
defines what the character should be. For h t th h
t h ld b F
example, 65 is an ‘A’.
• Fonts define how a given character should be displayed on the screen. The character ‘A’ can b di l d
be displayed as A A A A or T
A, A, A, A, or T
• Changing the font changes how the character appears on the screen but does not change the value or meaning of the character.
COMP370 Intro to Computer Architecture
3
Character Representation
2/2/2009
Pre‐Computer Printing
Flipped and Close Up
Origin of the word Font
Font Terms
• The term font, a cognate of the word fondue, d i
derives from Middle French fonte, meaning f
Middl F
hf t
i
"(something that has been) melt(ed)", referring to type produced by casting molten metal at a type foundry.
COMP370 Intro to Computer Architecture
4
Character Representation
2/2/2009
Points
• Fonts are traditionally measured in points.
• A point is 1/72 of an inch.
• Font size measures an invisible box which is typically a bit larger than the distance from the tallest ascender to the lowest descender.
• 12 point font is 1/6 of an inch
12 point font is 1/6 of an inch
Font Styles
• Most fonts come in four different styles
–Regular
Regular
–Italic
–Bold
–Bold Italic
Font Sizes
• Poster (extremely large sizes, usually larger than 72 point)
than 72 point) • Display (large sizes, typically 19‐72 point) • Subhead (large text, typically about 14‐18 point) • regular (typically about 10
regular (typically about 10‐13
13 point) point)
• Small text (typically about 8‐10 point) • Caption (very small, typically about 6‐8 point) Serifs
• Serifs comprise the small features at the end of strokes within letters.
• Fonts without serifs are sans serif.
• Different styles do not change the point size.
COMP370 Intro to Computer Architecture
5
Character Representation
2/2/2009
Proportional Fonts
• Not every letter is written as the same width.
• With proportional fonts some characters, like “W” are wider than others, like “i”.
• Monospaced fonts function better for some purposes because they line up in neat, regular columns.
Strings
• In C++ and Java, a char variable can only hold one character.
h
t
• Strings are variables that can hold many characters.
• In C++ and Java, a character constant is surrounded by single quotes ‘A’
surrounded by single quotes, A while strings while strings
are surrounded by double quotes, “string”
Many Strings
Varying Length Strings
• There are many ways to represent strings.
• The C language did not include a string type. Strings were stored in arrays of char
• C strings are terminated by a null character, the character whose numerical value is zero.
• The char
The char array must be one longer than the array must be one longer than the
maximum string size to hold the null terminator character.
• C++ has several string classes.
• Java and C++ string classes allow strings to be as long as necessary (with a very large maximum).
• There is no visible terminating character is Java Strings
Java Strings
COMP370 Intro to Computer Architecture
6