Information Theory and Digital Representations

LIS 386.13
Information Technologies
and the
Information Professions
Information Theory and
Digital Representations
R. E. Wyllys
Copyright © 2002 by R. E. Wyllys
Last revised 2002 Sep 7
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Lesson Objectives
• You will
– Understand how the content of something that
contains information (viz., an Information-Bearing
Entity, or InBE) can be assessed in terms of the cost
of communicating and storing that content
– Understand how the "value" of an InBE's content
differs from its communication cost and can vary
from one user to another
– Become familiar with standard binary
representations of texts and of digitized images
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Data vs. Information et alia
• A classic set of distinctions among "data,"
"information," "knowledge," and "wisdom" is:
– Data = raw facts
– Information = raw facts processed to make them more
usable and/or understandable by humans
– Knowledge = "the fact or condition of knowing something
with familiarity gained through experience and association"*;
i.e., information as stored in a mind
– Wisdom = "1.a. accumulated philosophic or scientific
learning; b. ability to discern inner qualities and relationships;
c. good sense"*; i.e., knowledge plus good judgment
*From: Merriam-Webster Collegiate Dictionary, 10th ed.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Assessing Information
• How can we talk usefully about the amount,
and about the value, of the information
available from some set of data?
– Are there different amounts of information in
statements such as
• "two plus two equals four"
• "e = mc2"
• "The distance from Austin to San Antonio is about 78
miles."
– Intuitively, we feel that such statements differ in
the amounts of information they contain.
– The statements also differ in what we are likely to
feel are the values of the information they contain.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Valuing Information
• That different InBEs contain information with different
values goes to the heart of what our Library and
Information Science profession deals with, as
suggested by such questions as:
– What information should be preserved?
– What pieces of information will be most useful to a particular
user? to different sets of users?
– How can the value of a particular set of information differ
over time? among different users?
• Assessing the value of information not only is
extremely difficult but also is usually subjective and
often transitory
• In this lesson, our primary concern is with the
question of how to measure the amount of
information in an InBE, rather than its value
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Measuring Information
• We measure the amount of
information in an InBE in terms of
the cost of communicating
and/or storing the information in
digital form
• This approach is due to Dr. Claude
E. Shannon (1916 - 2001).
– For a sketch of Shannon's life and
work, see "Claude Shannon, Father of
Information Theory..."
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Measuring Information
• Shannon developed his ideas while working in
the field of cryptology during World War II. He
published them as The Mathematical Theory of
Communication in 1948 (when computers were
just beginning to attract notice from the public).
– As the title above indicates, Shannon thought of his
theory as dealing with communication. It is known in
most of the world as communication theory, but in
the U.S. it has become known as information theory.
– The latter name has unfortunately led some people to
conclude, mistakenly, that Shannon's theory somehow
deals with the value of information. It does not. It
deals only with the communicating and the storing of
information.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Measuring Information
• The essence of Shannon's theory is that one
can assess the amount of information in a
message in terms of the amount of
uncertainty that is removed by the arrival of
the message.
• Intuitively, it seems reasonable to say that the
greater the degree of uncertainty that is
removed (i.e., the greater the enlightenment
conveyed) by the message, the greater must
be the amount of information in the message.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Measuring Information
• Shannon puts it this way: Consider a Sender,
S, and a Recipient, R, of a message. How
much uncertainty in R's mind can be resolved
when she receives a message from S?
→
message
S
R
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Measuring Information
• The answer to the question, "How much
uncertainty in R's mind can be resolved when she
receives a message from S?", depends on the
situation. Consider the following:
– In 1775 Paul Revere arranged in advance to provide a
warning if a British attack became imminent From the
steeple of Old North Church in Boston he would hang one
lantern if the British were coming by land, and two lanterns if
they were coming by sea. There were just two possibilities
to be resolved by Revere's message.
– But in today's world, an attack could come by land, or sea, or
from under the sea (via submarine), or from the air (via
airplane), or from space (via missile). Thus today a message
analogous to Revere's would have to resolve which one of
five possibilities was the correct one.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Measuring Information
• How can we compare the amounts of information
in the messages in these different situations?
– In 1775 Paul Revere's warning message about which
of 2 possible attacks was imminent contained some
amount of information, which we can call R for short.
– The analogous message in today's world, indicating
which of 5 possible attacks was under way, would
contain some other amount of information, which we
can call T for short.
• It should seem reasonable to you that amount T
must be bigger than amount R.
– Why? Because the T message clears up a more complex
situation than the R message.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Measuring Information
• Shannon asks us to accept, as a premise, that it is
reasonable to assign to the simplest conceivable
situation, that of just two possibilities, the value of
one unit of information. Thus a message that
revealed which of the two possibilities was correct
would, by definition, contain one unit of information.
• Why is this the simplest conceivable situation?
– Because unless there are at least two possibilities, the
situation contains no uncertainty.
– And because if there are three possibilities, then the
situation clearly possesses a higher degree of uncertainty
than the situation with only two possibilities.
– And because a situation with four possibilities possesses
a still higher degree of uncertainty than that of a situation
with only three possibilities. And so on.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Measuring Information
• A colleague of Shannon's, the statistician Dr.
John W. Tukey (1915-2000), is credited with
coining the name bit for this minimum
conceivable amount of information, the amount
in a message that resolves a situation
possessing the lowest possible degree of
uncertainty.
– Tukey based the name on the fact that any such
message (i.e., a message resolving a situation with
just two possible outcomes) can be conveyed by a
vocabulary of just two words, or two symbols: for
example, 0 and 1.
– These symbols are the two digits in what is known as
binary arithmetic; and bit is an easy contraction of
"binary digit"
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
A Cautionary Note
• It is important here to observe, again, that Shannon's
theory deals only with the costs of communicating
information, not with the value of information, nor with
other aspects of communicating information.
• In Recent Contributions to the Mathematical Theory of
Communication, an excellent commentary on
Shannon's theory, Warren Weaver (1894 - 1978) wrote
that there can be problems at three levels of
communication
– "Level A: How accurately can the symbols of communication be
transmitted? (The technical problem.)
– "Level B: How precisely do the transmitted symbols convey the
desired meaning? (The semantic problem.)
– "Level C: How effectively does the received meaning affect
conduct in the desired way? (The effectiveness problem.)"
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
A Cautionary Note
• Weaver continues:
– "The technical problems are concerned with the
accuracy of transference from sender to receiver of
sets of symbols (written speech), or of a continuously
varying signal (telephonic or radio transmission . . .), or
of a continuously varying two-dimensional pattern
(television), etc.
– "The semantic problems are concerned with the
identity, or satisfactorily close approximation, in the
interpretation of meaning by the receiver, as compared
with the intended meaning of the sender. This is a very
deep and involved situation. . . .
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
A Cautionary Note
– "One essential complication is illustrated by the remark
that if Mr. X is suspected not to understand what Mr. Y
says, then it is theoretically not possible, by having Mr. Y
do nothing but talk further with Mr. X, completely to clarify
this situation in any finite time. If Mr. Y says 'Do you now
understand me?' and Mr. X says 'Certainly, I do', this is
not necessarily a certification that understanding has
been achieved. It may be just that Mr. X did not
understand the question.
– "The effectiveness problems are concerned with the
success with which the meaning conveyed to the receiver
leads to the desired conduct on his part. It may seem at
first glance undesirably narrow to imply that the purpose
of all communication is to influence the conduct of the
receiver. But with any reasonably broad definition of
conduct, it is clear that communication either affects
conduct or is without any discernible and probable effect
at all.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
A Cautionary Note
– "The mathematical theory of communication . . .
admittedly applies in the first instance only to . . . the
technical problem of accuracy of transference of . . .
signals from sender to receiver. But the theory has, I
think, a deep significance [which] comes from the facts
that Levels B and C . . . can make use only of those
signal accuracies which turn out to be possible when
analyzed at Level A. Thus any limitations discovered in
the theory at Level A necessarily apply to Levels B and C.
But a larger part of the significance comes from the fact
that the analysis at Level A discloses that this level
overlaps the other levels more than one could possibly
naively suspect. Thus the theory of Level A is, at least to
a significant degree, also a theory of Levels B and C."
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
A Cautionary Note
• What the present lesson deals with is merely
what Weaver calls the "technical problem."
• Note: The papers by Shannon and Weaver
discussed here have been published together as
a book:
– Shannon, C. E., & Weaver, W. (1949). The
Mathematical Theory of Communication. Urbana, IL:
University of Illinois Press. ISBN: 0-252-72548-4
It may be of interest to GSLIS students that Warren Weaver was a noted
collector of the works of the mathematician Charles Dodgson (who is better
known by his pseudonym, Lewis Carroll), and that Weaver's collection is
now in the Harry Ransom Humanities Research Collection at UT-Austin.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Communicating and Storing
Information
• Shannon's theory grew, in part, out of the
fact that a vocabulary of just two symbols
is extraordinarily well suited to the physical
realities of
– sending messages by electrical, electronic,
and optical means
– storing messages by magnetic, electronic,
optical, and similar means
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Communicating and Storing
Information
• The essence of these realities is that 2-valued
states tend to be much more easily
distinguishable than 3-valued, 4-valued, etc.
states. For example:
– It is easier to detect whether an electrical voltage is
present or not present than to determine whether its
strength is 1 or 2 or 3 or 4 or 5, etc., volts.
– It is easier to determine whether a tiny area of
magnetized iron oxide has its North pole or its South
pole pointing up than to determine the exact strength of
the magnetization.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Communicating and Storing
Information
• Thus it turns out to be much more practical to
send messages by turning voltages on and off
(two values) than by trying to use several
different levels of voltage to represent several
different values.
• Similarly, it is much more practical to store
messages in the form of patterns of magnetized
dots that are detected as "North up" or "South
up" (two values) than to try to use several
different intensities of magnetization to represent
several different values.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Sending Messages Using
Binary Digits
• Given that it makes good sense to use
just two values, how can we contrive—
with just two values—to send more than
two different messages?
– Consider what happens if we allow pairs of values,
such as 00, 01, 10, and 11. Then 4 different
messages can be sent by using only two physical
states that are easily distinguishable.
– These 4 different messages can, of course, have
meanings that the Sender and the Recipient have
agreed upon in advance.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Sending Messages Using
Binary Digits
• In similar fashion, if we allow triples (also
called 3-tuples) of values, such as 000,
001, 010, 011, 100, 101, 110, and 111,
then 8 different messages can be sent,
still using just easily distinguishable
physical states.
• Again, the 8 different messages can have
meanings that Sender and Recipient
have agreed on.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Sending Messages Using
Binary Digits
Analogously, if we allow 4-tuples of values, such
as 0000, 0001, etc., then 16 different, agreed-on
messages can be sent.
0000
1000
0001
1001
0010
1010
0011
1011
0100
1100
0101
1101
0110
1110
0111
1111
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Sending Messages Using
Binary Digits
Proceeding in the same fashion, if we allow 5-tuples
of values, such as 00000, 00001, etc., then 32
different, agreed-on messages can be sent.
00000
01000
10000
11000
00001
01001
10001
11001
00010
01010
10010
11010
00011
01011
10011
11011
00100
01100
10100
11100
00101
01101
10101
11101
00110
01110
10110
11110
00111
01111
10111
11111
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Sending Messages Using
Binary Digits
• Clearly, we could continue in the same fashion, using
6-tuples, 7-tuples, etc., as far as we might wish.
• If we restrict ourselves to just two symbols, e.g., 0
and 1, the procedure can be summarized as follows:
– Each position in the n-tuple, i.e., string of n symbols, can
have either of the two symbols
– Hence, for pairs, we have 2×2 = 22 = 4 different strings
– For 3-tuples, we have 2×2×2 = 23 = 8 different strings
– For 4-tuples, we have 2×2×2×2 = 24 = 16 different strings
– For 5-tuples, we have 2×2×2×2×2 = 25 = 32 different strings
– And so on. In general, the pattern tells us that for an n-tuple,
there are 2n different strings; i.e., 2n different messages that
can be sent.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Sending Messages Using
Binary Digits
Here is another look at the table of 5-tuples. This time, we note
that 25 = 32 possibilities are enough to let us use them to denote
a very convenient set of agreed-on messages: viz., letters and
selected punctuation marks and special characters.
00000 = space
01000 = H
10000 = P
11000 = X
00001 = A
01001 = I
10001 = Q
11001 = Y
00010 = B
01010 = J
10010 = R
11010 = Z
00011 = C
01011 = K
10011 = S
11011 = .
00100 = D
01100 = L
10100 = T
11100 = ,
00101 = E
01101 = M
10101 = U
11101 = ?
00110 = F
01110 = N
10110 = V
11110 = $
00111 = G
01111 = O
10111 = W
11111 = %
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Sending Messages Using
Binary Digits
• The preceding slide shows one way of setting up a
one-to-one correspondence between strings of 5
binary digits and letters of the English alphabet, plus
space and some punctuation marks and characters.
– Note: The correspondences in the preceding slide are very
close to those actually used by Jean Baudot (1845-1903) in
his 1874 invention, the teletype machine.
• With these correspondences, we can use nothing but
binary digits to send any message that can be
spelled out in letters of the English alphabet.
– For many decades, telegrams used just this set of symbols,
even when the message included numbers. For example:
CAN YOU MEET ME AT RAILROAD STATION? MY TRAIN
ARRIVES EIGHT THIRTY A.M.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Sending Messages Using
Binary Digits
• With 6-tuples, there are 26 = 64 different strings,
enough to provide one-to-one correspondences
between the strings and 26 upper-case letters, 26
lower-case letters, 10 numerals, space, and period "."
– Computers, like other electronic apparatus, are well suited to
using 2-valued signals internally, since much of a computer's
circuitry operates by being in either an "on" or an "off" state
at any given moment; actions are accomplished by changes
from one state to the other (at enormously high speeds).
– Many computers in the 1940s and 1950s used essentially a
set of 6-binary-digit strings (6-bit strings) to display and/or
print out messages for humans.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
The ASCII Character Set
• By the 1960s, demands for something more
convenient for human needs had increased to
the point of stimulating a general shift from 6-bit
strings to 8-bit strings, since these provide for
28 = 256 possible different combinations.
These 8-bit strings are called bytes.
– A more-or-less standard set of 256 correspondences
was developed by the U.S. government and
computing industry. It was named the American
Standard Code for Information Interchange, or
ASCII, set.
– Note: Sets of 4 bits are sometimes called half-bytes,
or quadbits, or nibbles.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
The ASCII Character Set
• The ASCII character set was formally adopted as a U.S.
standard in 1968 by the then National Bureau of
Standards (now the National Institute for Standards and
Technology, NIST) and by the American National
Standards Institute (ANSI).
• The International Organization for Standardization
(ISO) later adopted the ASCII character set as an
international standard.
– This organization's name is abbreviated by "ISO", not "IOS" (as
you might expect if it were an acronym, which it is not).
– The reason is that "iso" means "equal" or "standard" in
Classical Greek. The organization chose this world-wide
abbreviation of its name in order to avoid the problems of using
different acronyms in different languages.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
The ASCII Character Set
• The ASCII set is "more-or-less standard" in the
following sense:
– Most modern computers use the first 128 8-bit patterns (those
beginning with a 0, often called the "low-order" characters) in
the same way, to represent upper- and lower-case letters of the
English alphabet plus numerals, a variety of punctuation
symbols, and some invisible control codes (e.g., the carriagereturn and end-of-file marks).
– The second 128 8-bit patterns (those beginning with a 1, called
the "high-order" characters) are used in slightly different ways
to represent accented characters in various Western European
languages; e.g., there are minor differences among the U.S.
English, French, German, and Spanish sets of high-order
characters.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Sample ASCII Characters
Symbol
Binary
Representation
Decimal
Equivalent
Hexadecimal
Equivalent
space
00100000
32
20
$
00100011
35
23
(
00101000
40
28
.
00101110
46
2E
1
00110001
49
31
2
00110010
50
32
A
01000001
65
41
B
01000010
66
42
a
01100001
97
61
b
01100010
98
62
É
11001001
201
C9
é
11101001
233
E9
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
The Unicode Character Set
• The 256-symbol ASCII character set, and its minor
variants, handle languages with Latin alphabets quite
well. But in this world there are also
– Many languages that use non-Latin alphabets (e.g., Arabic,
Cherokee, Cyrillic, Hebrew, Hindi, Korean)
– Some languages that use ideographs (e.g., Chinese,
Japanese) rather than alphabets
• The existence of non-Latin-alphabet and ideographic
languages means that the total set of symbols that
people around the world want to use in computers and
telecommunications is vastly larger than 256.
• To answer the need for a standard set of symbols
capable of serving all written languages, the ISO and its
affiliated standards organizations have worked together
as the Unicode Consortium to develop the Unicode
Character Set.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
The Unicode Character Set
• The Unicode characters consist of 16-bit strings, and are
often referred to as double-byte characters.
• With 16 bits, the total number of possible distinct messages
(i.e., symbols) is 216 = 65,536, which is enough to
accommodate all written languages.
• The ASCII characters are represented in Unicode by 16-bit
strings in which the first 8 bits are all 0; non-ASCII strings in
Unicode have at least one 1 among their first 8 bits. Thus,
the ASCII character set is a subset of Unicode.
• As an example, the Chinese character for the word "zero“
(as distinguished from the digit “0”) below is represented in
Unicode by 1001011011110110, which equals 96F6 in
hexadecimal notation and 38,646 in decimal notation.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Prefixes Denoting Quantities
• Our next topic deals with the storage of large
quantities of information.
• The discussion will be aided by the use of
standard prefixes for denoting large and small
quantities.
– You are undoubtedly already familiar with some
of these prefixes:
• "kilo" denotes 1,000 of the basic units (e.g.,
kilometer, kilohertz)
• "micro" denotes 1/1,000,000, or one millionth, of
the basic unit (e.g., microsecond)
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Prefixes Denoting Quantities
• The standard prefixes that have been defined
to date by the International Organization for
Standardization (ISO) range from 1024, one
sextillion of the basic units, to 10-24, or one
sextillionth of the basic unit.
– This may seem to you to be an enormous range,
and it is.
– But the ISO set of prefixes has already been
extended twice in recent decades in order to handle
the need for dealing with both ever larger and also
ever smaller quantities.
– This need arises from the seemingly inexorable
advances of science and technology. You can
expect new prefixes to be defined in the future.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Prefixes Denoting Quantities
Prefix
Symbol
Multiplier
Exponential
Form
yotta
Y
1,000,000,000,000,000,000,000,000
1024
zetta
Z
1,000,000,000,000,000,000,000
1021
exa
E
1,000,000,000,000,000,000
1018
peta
P
1,000,000,000,000,000
1015
tera
T
1,000,000,000,000
1012
giga
G
1,000,000,000
109
mega
M
1,000,000
106
kilo
k
1,000
103
hecto
h
100
102
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Prefixes Denoting Quantities
Prefix
Symbol
Multiplier
Exponential
Form
centi
c
0.01
10-2
milli
m
0.001
10-3
micro
μ
0.000 001
10-6
nano
n
0.000 000 001
10-9
pico
p
0.000 000 000 001
10-12
femto
f
0.000 000 000 000 001
10-15
atto
a
0.000 000 000 000 000 001
10-18
zepto
z
0.000 000 000 000 000 000 001
10-21
yocto
y
0.000 000 000 000 000 000 000 001
10-24
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Storing Quantities of Characters
• In English (and other alphabetic languages) letters
are combined to make up words; words make up
sentences; sentences, paragraphs; paragraphs,
chapters or letters or articles; and so on.
• What does all this imply for storage?
– One letter is equivalent to 1 byte, consisting of 8 bits.
– In English, the average length of a word is 5.8
letters, plus the necessary space (which really is the
27th letter of the English alphabet) that separates a
word from its successor. So 1 word = 6.8×8 = 54.4
bits ≈ 7 bytes. (The symbol "≈" means
"approximately equal.")
– A typical journal article consists of about 3,500
words, which is equivalent to 3,500×7 = 24,500
bytes; or roughly 25,000 bytes.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Storing Quantities of Characters
• A typical book contains around 100,000 words;
or roughly 700,000 bytes.
• The Encyclopedia Britannica contains, in its 31
physical volumes, about 43 million words (or
about 1.4 million words per volume); in all,
roughly 300 million bytes, i.e., 300 megabytes
(300MB).
– Note: The CD version of this encyclopedia contains
approximately 650MB, so about 350MB of storage
must be devoted to non-text materials, i.e.,
illustrations. Storing illustrations involves techniques
for digitizing images.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Storing Quantities of Characters
• Storing images electronically, i.e., in digitized
form, is more complicated than storing text,
since the resultant size in bytes of a file
containing a digitized image is affected by
– The resolution (dots per inch, DPI) of the scanning
– The degree of color accuracy desired (unless the
image uses only black and white)
• Color accuracy can range from 16 colors, requiring 4 color
bits per pixel, to 16,777,216 colors, requiring 24 color bits
per pixel
– The degree of compression used
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Storing Quantities of Characters
• Some examples:
– The Chinese ideograph we saw earlier, a simple
black-and-white line drawing, is stored in a GIF file
totaling 289 bytes.
• GIF files use lossy compression, i.e., they try to set a
reasonable compromise between image quality and file size
by sacrificing some quality in order to reduce the file size.
• Note: JPEG files also use lossy compression.
– I recently scanned an 8"x10" black-and-white
photograph into a medium-resolution TIFF file. The
file contained about 940,000 bytes (i.e., 0.94MB).
TIFF files use lossless compression.
– I also scanned an 8"x10" color photograph of the
same scene into a TIFF file. This file contained
about 2.8MB.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
How Many Bytes are in the
UT-Austin General Libraries?
• Another example: The UT-Austin General
Libraries contain about 8 million volumes.
– If these volumes consisted solely of text, they would
total roughly 8,000,000×700,000 = 5.6 trillion bytes,
i.e., 5.6 terabytes (5.6TB).
– If, on the other hand, they contain about the same
mixture of text and illustrations as the Encyclopedia
Britannica, then they must total roughly 12.1TB of
information.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
How Many Bytes are in the
UT-Austin General Libraries?
• Using the estimate of 12.1TB as the amount of information in
the UT-Austin General Libraries, we can note further that
– The Perry-Castañeda Library (PCL), one unit of the General Libraries,
contains about 1.5 million volumes, or about 1.06TB of information.
– PCL offers 349,313 lineal feet (about 66.2 miles or 106.5 kilometers) of
shelf space* for these volumes. The PCL shelves currently provide some
room for expansion of the collection.
– Estimating the currently occupied PCL shelf space as 50% of the total
space, or about 174,656ft (53.2km), we can calculate that, in the form of
typical printed books, 1 yard of filled shelf space contains an average of
about 18.2MB of information (or that 1 meter of filled shelf space contains
an average of about 19.9MB of information).
*For this datum I am indebted to George Cogswell of the General Libraries staff.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
How Many Bytes are in the
UT-Austin General Libraries?
• In comparison with these dimensions for book
storage
– Computer storage systems in the multi-terabyte
range are currently being manufactured and used,
with physical sizes comparable to those of a
compact car or smaller.
– An interesting statistic from the World-Wide Web
provides an additional perspective:
• "Web ad placement firm DoubleClick currently maintains
over 100 terabytes of storage. If printed out, that would
equal about 300 single-spaced sheets of paper for every
Web user." [Source: Tracking the web of data you weave.
2000 October 2: U.S. News & World Report; p. 66]
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
How Many Bytes are in the
UT-Austin General Libraries?
• Another comparison:
– Using CDs to store 12.1TB (at approximately
650MB per CD) would require about 18,700 CDs.
– Since DVDs have about 10 times the storage
capacity of CDs, storing the contents of the
General Libraries on DVDs would require about
1,870 DVDs.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
How Many Bytes are in the
UT-Austin General Libraries?
• A standard DVD case measures 19cm high, 13.7cm wide, and
1.5cm thick.
– To store 1,870 DVDs in their cases would require 1870×1.5cm =
28.05m (about 92 lineal feet) of shelf space of suitable height.
Total volume: 1870×20×14×1.5 = 785,400cm3 = 0.79m3 ≈ 27.7ft3.
• A standard CD case measures 12.6cm high, 14.3cm wide, and
1cm thick.
– To store 18,700 CDs in their cases would require 18700×1cm =
187m (about 614 lineal feet) of shelf space of suitable height.
Total volume: 18700×12.6×14.3×1 = 3.37m3 ≈ 199ft3.
• Note: These calculations are very rough estimates, but they
furnish a general idea of the reduction in physical size that can
result from storing information in binary form in computerrelated media rather than in the form of texts and images on
paper.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
How Many Bytes of Information Are
Produced Annually in the World
• In October 2000 Profs. Peter Lyman and Hal Varian of
the School of Information Management and Systems,
University of California, Berkeley, released a study of
the world's annual production of information.
– The study, sponsored by EMC Corp., is entitled How Much
Information?.
– The authors estimate that "the world's total yearly production
of print, film, optical, and magnetic content would require
roughly 1.5 billion gigabytes [i.e., 1.5 exabytes] of storage.
This is the equivalent of 250 megabytes per person for each
man, woman, and child on earth."
– A summary of the study is available from a press release, UC
Berkeley Professors Measure Exploding World Production Of
New Information.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Saving Space Can Be Important
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions