LIS 386.13 Information Technologies and the Information Professions Information Theory and Digital Representations R. E. Wyllys Copyright © 2002 by R. E. Wyllys Last revised 2002 Sep 7 School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Lesson Objectives • You will – Understand how the content of something that contains information (viz., an Information-Bearing Entity, or InBE) can be assessed in terms of the cost of communicating and storing that content – Understand how the "value" of an InBE's content differs from its communication cost and can vary from one user to another – Become familiar with standard binary representations of texts and of digitized images School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Data vs. Information et alia • A classic set of distinctions among "data," "information," "knowledge," and "wisdom" is: – Data = raw facts – Information = raw facts processed to make them more usable and/or understandable by humans – Knowledge = "the fact or condition of knowing something with familiarity gained through experience and association"*; i.e., information as stored in a mind – Wisdom = "1.a. accumulated philosophic or scientific learning; b. ability to discern inner qualities and relationships; c. good sense"*; i.e., knowledge plus good judgment *From: Merriam-Webster Collegiate Dictionary, 10th ed. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Assessing Information • How can we talk usefully about the amount, and about the value, of the information available from some set of data? – Are there different amounts of information in statements such as • "two plus two equals four" • "e = mc2" • "The distance from Austin to San Antonio is about 78 miles." – Intuitively, we feel that such statements differ in the amounts of information they contain. – The statements also differ in what we are likely to feel are the values of the information they contain. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Valuing Information • That different InBEs contain information with different values goes to the heart of what our Library and Information Science profession deals with, as suggested by such questions as: – What information should be preserved? – What pieces of information will be most useful to a particular user? to different sets of users? – How can the value of a particular set of information differ over time? among different users? • Assessing the value of information not only is extremely difficult but also is usually subjective and often transitory • In this lesson, our primary concern is with the question of how to measure the amount of information in an InBE, rather than its value School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Measuring Information • We measure the amount of information in an InBE in terms of the cost of communicating and/or storing the information in digital form • This approach is due to Dr. Claude E. Shannon (1916 - 2001). – For a sketch of Shannon's life and work, see "Claude Shannon, Father of Information Theory..." School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Measuring Information • Shannon developed his ideas while working in the field of cryptology during World War II. He published them as The Mathematical Theory of Communication in 1948 (when computers were just beginning to attract notice from the public). – As the title above indicates, Shannon thought of his theory as dealing with communication. It is known in most of the world as communication theory, but in the U.S. it has become known as information theory. – The latter name has unfortunately led some people to conclude, mistakenly, that Shannon's theory somehow deals with the value of information. It does not. It deals only with the communicating and the storing of information. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Measuring Information • The essence of Shannon's theory is that one can assess the amount of information in a message in terms of the amount of uncertainty that is removed by the arrival of the message. • Intuitively, it seems reasonable to say that the greater the degree of uncertainty that is removed (i.e., the greater the enlightenment conveyed) by the message, the greater must be the amount of information in the message. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Measuring Information • Shannon puts it this way: Consider a Sender, S, and a Recipient, R, of a message. How much uncertainty in R's mind can be resolved when she receives a message from S? → message S R School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Measuring Information • The answer to the question, "How much uncertainty in R's mind can be resolved when she receives a message from S?", depends on the situation. Consider the following: – In 1775 Paul Revere arranged in advance to provide a warning if a British attack became imminent From the steeple of Old North Church in Boston he would hang one lantern if the British were coming by land, and two lanterns if they were coming by sea. There were just two possibilities to be resolved by Revere's message. – But in today's world, an attack could come by land, or sea, or from under the sea (via submarine), or from the air (via airplane), or from space (via missile). Thus today a message analogous to Revere's would have to resolve which one of five possibilities was the correct one. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Measuring Information • How can we compare the amounts of information in the messages in these different situations? – In 1775 Paul Revere's warning message about which of 2 possible attacks was imminent contained some amount of information, which we can call R for short. – The analogous message in today's world, indicating which of 5 possible attacks was under way, would contain some other amount of information, which we can call T for short. • It should seem reasonable to you that amount T must be bigger than amount R. – Why? Because the T message clears up a more complex situation than the R message. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Measuring Information • Shannon asks us to accept, as a premise, that it is reasonable to assign to the simplest conceivable situation, that of just two possibilities, the value of one unit of information. Thus a message that revealed which of the two possibilities was correct would, by definition, contain one unit of information. • Why is this the simplest conceivable situation? – Because unless there are at least two possibilities, the situation contains no uncertainty. – And because if there are three possibilities, then the situation clearly possesses a higher degree of uncertainty than the situation with only two possibilities. – And because a situation with four possibilities possesses a still higher degree of uncertainty than that of a situation with only three possibilities. And so on. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Measuring Information • A colleague of Shannon's, the statistician Dr. John W. Tukey (1915-2000), is credited with coining the name bit for this minimum conceivable amount of information, the amount in a message that resolves a situation possessing the lowest possible degree of uncertainty. – Tukey based the name on the fact that any such message (i.e., a message resolving a situation with just two possible outcomes) can be conveyed by a vocabulary of just two words, or two symbols: for example, 0 and 1. – These symbols are the two digits in what is known as binary arithmetic; and bit is an easy contraction of "binary digit" School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions A Cautionary Note • It is important here to observe, again, that Shannon's theory deals only with the costs of communicating information, not with the value of information, nor with other aspects of communicating information. • In Recent Contributions to the Mathematical Theory of Communication, an excellent commentary on Shannon's theory, Warren Weaver (1894 - 1978) wrote that there can be problems at three levels of communication – "Level A: How accurately can the symbols of communication be transmitted? (The technical problem.) – "Level B: How precisely do the transmitted symbols convey the desired meaning? (The semantic problem.) – "Level C: How effectively does the received meaning affect conduct in the desired way? (The effectiveness problem.)" School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions A Cautionary Note • Weaver continues: – "The technical problems are concerned with the accuracy of transference from sender to receiver of sets of symbols (written speech), or of a continuously varying signal (telephonic or radio transmission . . .), or of a continuously varying two-dimensional pattern (television), etc. – "The semantic problems are concerned with the identity, or satisfactorily close approximation, in the interpretation of meaning by the receiver, as compared with the intended meaning of the sender. This is a very deep and involved situation. . . . School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions A Cautionary Note – "One essential complication is illustrated by the remark that if Mr. X is suspected not to understand what Mr. Y says, then it is theoretically not possible, by having Mr. Y do nothing but talk further with Mr. X, completely to clarify this situation in any finite time. If Mr. Y says 'Do you now understand me?' and Mr. X says 'Certainly, I do', this is not necessarily a certification that understanding has been achieved. It may be just that Mr. X did not understand the question. – "The effectiveness problems are concerned with the success with which the meaning conveyed to the receiver leads to the desired conduct on his part. It may seem at first glance undesirably narrow to imply that the purpose of all communication is to influence the conduct of the receiver. But with any reasonably broad definition of conduct, it is clear that communication either affects conduct or is without any discernible and probable effect at all. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions A Cautionary Note – "The mathematical theory of communication . . . admittedly applies in the first instance only to . . . the technical problem of accuracy of transference of . . . signals from sender to receiver. But the theory has, I think, a deep significance [which] comes from the facts that Levels B and C . . . can make use only of those signal accuracies which turn out to be possible when analyzed at Level A. Thus any limitations discovered in the theory at Level A necessarily apply to Levels B and C. But a larger part of the significance comes from the fact that the analysis at Level A discloses that this level overlaps the other levels more than one could possibly naively suspect. Thus the theory of Level A is, at least to a significant degree, also a theory of Levels B and C." School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions A Cautionary Note • What the present lesson deals with is merely what Weaver calls the "technical problem." • Note: The papers by Shannon and Weaver discussed here have been published together as a book: – Shannon, C. E., & Weaver, W. (1949). The Mathematical Theory of Communication. Urbana, IL: University of Illinois Press. ISBN: 0-252-72548-4 It may be of interest to GSLIS students that Warren Weaver was a noted collector of the works of the mathematician Charles Dodgson (who is better known by his pseudonym, Lewis Carroll), and that Weaver's collection is now in the Harry Ransom Humanities Research Collection at UT-Austin. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Communicating and Storing Information • Shannon's theory grew, in part, out of the fact that a vocabulary of just two symbols is extraordinarily well suited to the physical realities of – sending messages by electrical, electronic, and optical means – storing messages by magnetic, electronic, optical, and similar means School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Communicating and Storing Information • The essence of these realities is that 2-valued states tend to be much more easily distinguishable than 3-valued, 4-valued, etc. states. For example: – It is easier to detect whether an electrical voltage is present or not present than to determine whether its strength is 1 or 2 or 3 or 4 or 5, etc., volts. – It is easier to determine whether a tiny area of magnetized iron oxide has its North pole or its South pole pointing up than to determine the exact strength of the magnetization. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Communicating and Storing Information • Thus it turns out to be much more practical to send messages by turning voltages on and off (two values) than by trying to use several different levels of voltage to represent several different values. • Similarly, it is much more practical to store messages in the form of patterns of magnetized dots that are detected as "North up" or "South up" (two values) than to try to use several different intensities of magnetization to represent several different values. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Sending Messages Using Binary Digits • Given that it makes good sense to use just two values, how can we contrive— with just two values—to send more than two different messages? – Consider what happens if we allow pairs of values, such as 00, 01, 10, and 11. Then 4 different messages can be sent by using only two physical states that are easily distinguishable. – These 4 different messages can, of course, have meanings that the Sender and the Recipient have agreed upon in advance. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Sending Messages Using Binary Digits • In similar fashion, if we allow triples (also called 3-tuples) of values, such as 000, 001, 010, 011, 100, 101, 110, and 111, then 8 different messages can be sent, still using just easily distinguishable physical states. • Again, the 8 different messages can have meanings that Sender and Recipient have agreed on. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Sending Messages Using Binary Digits Analogously, if we allow 4-tuples of values, such as 0000, 0001, etc., then 16 different, agreed-on messages can be sent. 0000 1000 0001 1001 0010 1010 0011 1011 0100 1100 0101 1101 0110 1110 0111 1111 School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Sending Messages Using Binary Digits Proceeding in the same fashion, if we allow 5-tuples of values, such as 00000, 00001, etc., then 32 different, agreed-on messages can be sent. 00000 01000 10000 11000 00001 01001 10001 11001 00010 01010 10010 11010 00011 01011 10011 11011 00100 01100 10100 11100 00101 01101 10101 11101 00110 01110 10110 11110 00111 01111 10111 11111 School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Sending Messages Using Binary Digits • Clearly, we could continue in the same fashion, using 6-tuples, 7-tuples, etc., as far as we might wish. • If we restrict ourselves to just two symbols, e.g., 0 and 1, the procedure can be summarized as follows: – Each position in the n-tuple, i.e., string of n symbols, can have either of the two symbols – Hence, for pairs, we have 2×2 = 22 = 4 different strings – For 3-tuples, we have 2×2×2 = 23 = 8 different strings – For 4-tuples, we have 2×2×2×2 = 24 = 16 different strings – For 5-tuples, we have 2×2×2×2×2 = 25 = 32 different strings – And so on. In general, the pattern tells us that for an n-tuple, there are 2n different strings; i.e., 2n different messages that can be sent. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Sending Messages Using Binary Digits Here is another look at the table of 5-tuples. This time, we note that 25 = 32 possibilities are enough to let us use them to denote a very convenient set of agreed-on messages: viz., letters and selected punctuation marks and special characters. 00000 = space 01000 = H 10000 = P 11000 = X 00001 = A 01001 = I 10001 = Q 11001 = Y 00010 = B 01010 = J 10010 = R 11010 = Z 00011 = C 01011 = K 10011 = S 11011 = . 00100 = D 01100 = L 10100 = T 11100 = , 00101 = E 01101 = M 10101 = U 11101 = ? 00110 = F 01110 = N 10110 = V 11110 = $ 00111 = G 01111 = O 10111 = W 11111 = % School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Sending Messages Using Binary Digits • The preceding slide shows one way of setting up a one-to-one correspondence between strings of 5 binary digits and letters of the English alphabet, plus space and some punctuation marks and characters. – Note: The correspondences in the preceding slide are very close to those actually used by Jean Baudot (1845-1903) in his 1874 invention, the teletype machine. • With these correspondences, we can use nothing but binary digits to send any message that can be spelled out in letters of the English alphabet. – For many decades, telegrams used just this set of symbols, even when the message included numbers. For example: CAN YOU MEET ME AT RAILROAD STATION? MY TRAIN ARRIVES EIGHT THIRTY A.M. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Sending Messages Using Binary Digits • With 6-tuples, there are 26 = 64 different strings, enough to provide one-to-one correspondences between the strings and 26 upper-case letters, 26 lower-case letters, 10 numerals, space, and period "." – Computers, like other electronic apparatus, are well suited to using 2-valued signals internally, since much of a computer's circuitry operates by being in either an "on" or an "off" state at any given moment; actions are accomplished by changes from one state to the other (at enormously high speeds). – Many computers in the 1940s and 1950s used essentially a set of 6-binary-digit strings (6-bit strings) to display and/or print out messages for humans. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions The ASCII Character Set • By the 1960s, demands for something more convenient for human needs had increased to the point of stimulating a general shift from 6-bit strings to 8-bit strings, since these provide for 28 = 256 possible different combinations. These 8-bit strings are called bytes. – A more-or-less standard set of 256 correspondences was developed by the U.S. government and computing industry. It was named the American Standard Code for Information Interchange, or ASCII, set. – Note: Sets of 4 bits are sometimes called half-bytes, or quadbits, or nibbles. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions The ASCII Character Set • The ASCII character set was formally adopted as a U.S. standard in 1968 by the then National Bureau of Standards (now the National Institute for Standards and Technology, NIST) and by the American National Standards Institute (ANSI). • The International Organization for Standardization (ISO) later adopted the ASCII character set as an international standard. – This organization's name is abbreviated by "ISO", not "IOS" (as you might expect if it were an acronym, which it is not). – The reason is that "iso" means "equal" or "standard" in Classical Greek. The organization chose this world-wide abbreviation of its name in order to avoid the problems of using different acronyms in different languages. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions The ASCII Character Set • The ASCII set is "more-or-less standard" in the following sense: – Most modern computers use the first 128 8-bit patterns (those beginning with a 0, often called the "low-order" characters) in the same way, to represent upper- and lower-case letters of the English alphabet plus numerals, a variety of punctuation symbols, and some invisible control codes (e.g., the carriagereturn and end-of-file marks). – The second 128 8-bit patterns (those beginning with a 1, called the "high-order" characters) are used in slightly different ways to represent accented characters in various Western European languages; e.g., there are minor differences among the U.S. English, French, German, and Spanish sets of high-order characters. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Sample ASCII Characters Symbol Binary Representation Decimal Equivalent Hexadecimal Equivalent space 00100000 32 20 $ 00100011 35 23 ( 00101000 40 28 . 00101110 46 2E 1 00110001 49 31 2 00110010 50 32 A 01000001 65 41 B 01000010 66 42 a 01100001 97 61 b 01100010 98 62 É 11001001 201 C9 é 11101001 233 E9 School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions The Unicode Character Set • The 256-symbol ASCII character set, and its minor variants, handle languages with Latin alphabets quite well. But in this world there are also – Many languages that use non-Latin alphabets (e.g., Arabic, Cherokee, Cyrillic, Hebrew, Hindi, Korean) – Some languages that use ideographs (e.g., Chinese, Japanese) rather than alphabets • The existence of non-Latin-alphabet and ideographic languages means that the total set of symbols that people around the world want to use in computers and telecommunications is vastly larger than 256. • To answer the need for a standard set of symbols capable of serving all written languages, the ISO and its affiliated standards organizations have worked together as the Unicode Consortium to develop the Unicode Character Set. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions The Unicode Character Set • The Unicode characters consist of 16-bit strings, and are often referred to as double-byte characters. • With 16 bits, the total number of possible distinct messages (i.e., symbols) is 216 = 65,536, which is enough to accommodate all written languages. • The ASCII characters are represented in Unicode by 16-bit strings in which the first 8 bits are all 0; non-ASCII strings in Unicode have at least one 1 among their first 8 bits. Thus, the ASCII character set is a subset of Unicode. • As an example, the Chinese character for the word "zero“ (as distinguished from the digit “0”) below is represented in Unicode by 1001011011110110, which equals 96F6 in hexadecimal notation and 38,646 in decimal notation. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Prefixes Denoting Quantities • Our next topic deals with the storage of large quantities of information. • The discussion will be aided by the use of standard prefixes for denoting large and small quantities. – You are undoubtedly already familiar with some of these prefixes: • "kilo" denotes 1,000 of the basic units (e.g., kilometer, kilohertz) • "micro" denotes 1/1,000,000, or one millionth, of the basic unit (e.g., microsecond) School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Prefixes Denoting Quantities • The standard prefixes that have been defined to date by the International Organization for Standardization (ISO) range from 1024, one sextillion of the basic units, to 10-24, or one sextillionth of the basic unit. – This may seem to you to be an enormous range, and it is. – But the ISO set of prefixes has already been extended twice in recent decades in order to handle the need for dealing with both ever larger and also ever smaller quantities. – This need arises from the seemingly inexorable advances of science and technology. You can expect new prefixes to be defined in the future. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Prefixes Denoting Quantities Prefix Symbol Multiplier Exponential Form yotta Y 1,000,000,000,000,000,000,000,000 1024 zetta Z 1,000,000,000,000,000,000,000 1021 exa E 1,000,000,000,000,000,000 1018 peta P 1,000,000,000,000,000 1015 tera T 1,000,000,000,000 1012 giga G 1,000,000,000 109 mega M 1,000,000 106 kilo k 1,000 103 hecto h 100 102 School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Prefixes Denoting Quantities Prefix Symbol Multiplier Exponential Form centi c 0.01 10-2 milli m 0.001 10-3 micro μ 0.000 001 10-6 nano n 0.000 000 001 10-9 pico p 0.000 000 000 001 10-12 femto f 0.000 000 000 000 001 10-15 atto a 0.000 000 000 000 000 001 10-18 zepto z 0.000 000 000 000 000 000 001 10-21 yocto y 0.000 000 000 000 000 000 000 001 10-24 School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Storing Quantities of Characters • In English (and other alphabetic languages) letters are combined to make up words; words make up sentences; sentences, paragraphs; paragraphs, chapters or letters or articles; and so on. • What does all this imply for storage? – One letter is equivalent to 1 byte, consisting of 8 bits. – In English, the average length of a word is 5.8 letters, plus the necessary space (which really is the 27th letter of the English alphabet) that separates a word from its successor. So 1 word = 6.8×8 = 54.4 bits ≈ 7 bytes. (The symbol "≈" means "approximately equal.") – A typical journal article consists of about 3,500 words, which is equivalent to 3,500×7 = 24,500 bytes; or roughly 25,000 bytes. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Storing Quantities of Characters • A typical book contains around 100,000 words; or roughly 700,000 bytes. • The Encyclopedia Britannica contains, in its 31 physical volumes, about 43 million words (or about 1.4 million words per volume); in all, roughly 300 million bytes, i.e., 300 megabytes (300MB). – Note: The CD version of this encyclopedia contains approximately 650MB, so about 350MB of storage must be devoted to non-text materials, i.e., illustrations. Storing illustrations involves techniques for digitizing images. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Storing Quantities of Characters • Storing images electronically, i.e., in digitized form, is more complicated than storing text, since the resultant size in bytes of a file containing a digitized image is affected by – The resolution (dots per inch, DPI) of the scanning – The degree of color accuracy desired (unless the image uses only black and white) • Color accuracy can range from 16 colors, requiring 4 color bits per pixel, to 16,777,216 colors, requiring 24 color bits per pixel – The degree of compression used School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Storing Quantities of Characters • Some examples: – The Chinese ideograph we saw earlier, a simple black-and-white line drawing, is stored in a GIF file totaling 289 bytes. • GIF files use lossy compression, i.e., they try to set a reasonable compromise between image quality and file size by sacrificing some quality in order to reduce the file size. • Note: JPEG files also use lossy compression. – I recently scanned an 8"x10" black-and-white photograph into a medium-resolution TIFF file. The file contained about 940,000 bytes (i.e., 0.94MB). TIFF files use lossless compression. – I also scanned an 8"x10" color photograph of the same scene into a TIFF file. This file contained about 2.8MB. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions How Many Bytes are in the UT-Austin General Libraries? • Another example: The UT-Austin General Libraries contain about 8 million volumes. – If these volumes consisted solely of text, they would total roughly 8,000,000×700,000 = 5.6 trillion bytes, i.e., 5.6 terabytes (5.6TB). – If, on the other hand, they contain about the same mixture of text and illustrations as the Encyclopedia Britannica, then they must total roughly 12.1TB of information. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions How Many Bytes are in the UT-Austin General Libraries? • Using the estimate of 12.1TB as the amount of information in the UT-Austin General Libraries, we can note further that – The Perry-Castañeda Library (PCL), one unit of the General Libraries, contains about 1.5 million volumes, or about 1.06TB of information. – PCL offers 349,313 lineal feet (about 66.2 miles or 106.5 kilometers) of shelf space* for these volumes. The PCL shelves currently provide some room for expansion of the collection. – Estimating the currently occupied PCL shelf space as 50% of the total space, or about 174,656ft (53.2km), we can calculate that, in the form of typical printed books, 1 yard of filled shelf space contains an average of about 18.2MB of information (or that 1 meter of filled shelf space contains an average of about 19.9MB of information). *For this datum I am indebted to George Cogswell of the General Libraries staff. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions How Many Bytes are in the UT-Austin General Libraries? • In comparison with these dimensions for book storage – Computer storage systems in the multi-terabyte range are currently being manufactured and used, with physical sizes comparable to those of a compact car or smaller. – An interesting statistic from the World-Wide Web provides an additional perspective: • "Web ad placement firm DoubleClick currently maintains over 100 terabytes of storage. If printed out, that would equal about 300 single-spaced sheets of paper for every Web user." [Source: Tracking the web of data you weave. 2000 October 2: U.S. News & World Report; p. 66] School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions How Many Bytes are in the UT-Austin General Libraries? • Another comparison: – Using CDs to store 12.1TB (at approximately 650MB per CD) would require about 18,700 CDs. – Since DVDs have about 10 times the storage capacity of CDs, storing the contents of the General Libraries on DVDs would require about 1,870 DVDs. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions How Many Bytes are in the UT-Austin General Libraries? • A standard DVD case measures 19cm high, 13.7cm wide, and 1.5cm thick. – To store 1,870 DVDs in their cases would require 1870×1.5cm = 28.05m (about 92 lineal feet) of shelf space of suitable height. Total volume: 1870×20×14×1.5 = 785,400cm3 = 0.79m3 ≈ 27.7ft3. • A standard CD case measures 12.6cm high, 14.3cm wide, and 1cm thick. – To store 18,700 CDs in their cases would require 18700×1cm = 187m (about 614 lineal feet) of shelf space of suitable height. Total volume: 18700×12.6×14.3×1 = 3.37m3 ≈ 199ft3. • Note: These calculations are very rough estimates, but they furnish a general idea of the reduction in physical size that can result from storing information in binary form in computerrelated media rather than in the form of texts and images on paper. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions How Many Bytes of Information Are Produced Annually in the World • In October 2000 Profs. Peter Lyman and Hal Varian of the School of Information Management and Systems, University of California, Berkeley, released a study of the world's annual production of information. – The study, sponsored by EMC Corp., is entitled How Much Information?. – The authors estimate that "the world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes [i.e., 1.5 exabytes] of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth." – A summary of the study is available from a press release, UC Berkeley Professors Measure Exploding World Production Of New Information. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions Saving Space Can Be Important School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions
© Copyright 2026 Paperzz