Formal definitions of data, information and knowledge

[email protected]
Nov. 2015
Formal definitions of data,
information and knowledge:
an introduction
Martin Hilbert; University of California, Davis
Contents
How to formalize data, information and knowledge? .......................................... 3
Data symbols: symbolizing differences ............................................................................................... 4
Information: probabilistic uncertainty ................................................................................................ 8
The importance of compression: driver of the information explosion............................................. 12
How to convert data into information: coding................................................................................ 15
From information to communication: channels.............................................................................. 19
Knowledge: algorithmic descriptions ................................................................................................ 25
Kolmogorov complexity: description length ................................................................................... 26
Knowledge and algorithms............................................................................................................. 30
An ‘amazing’ and ‘beautiful’ fact ...................................................................................................... 33
Proof outline of the information-knowledge equivalence............................................................... 34
Some after-thoughts: the mixed use of information and knowledge…............................................ 37
References............................................................................................................39
The following introduces formal notions of information and knowledge from information theory and
computer science. Information is at the heart of our reality. In the physical realm, the ultimate speed of
information sets the limits on the macro-level (the speed of light), and informational loss due to
uncertainty in quantum effects sets the limit on the other extreme of the micro-level. In biology, the basis
of life on the micro-level consists of information encoded into genes, and macro-level evolutionary forces
produce complex information structure through uncertainty reducing natural selection. In the social
sciences, people spend the vast majority of their awake time interacting with mediated communication
devices, and on the macro-level big data applications and machine learning algorithms drive the economy,
democracy, and social organization. The Nobel-laureate and co-founder of the Santa Fe Institute, Murray
Gell-Mann, came to the conclusion that although the complex systems that make up our reality “differ
widely in their … attributes, they resemble one another in the way they handle information. That common
feature is perhaps the best starting point for exploring how they operate” (1995, p. 21).
1
[email protected]
Nov. 2015
Information theory is a branch of mathematics that belongs to probability theory and is nowadays
mainly taught in Electrical Engineering and Communication Departments. It is the rare breed of a branch
of science that can almost exclusively be traced back to one single and groundbreaking paper: Claude
Shannon’s (1948) “A mathematical theory of communication”. Shannon’s work is arguably the most
influential research of the past century, as it laid the fundaments of the digital age by conceptualizing the
informational bit (for a historical and entertaining overview with lots of anecdotes, see (Gleick, 2011)).
Following intuition, Shannon basically defined information as the opposite of uncertainty: with much
information, little uncertainty; and with much uncertainty, little information. Communication of
information is then the process of uncertainty reduction. Shannon showed that the adequate metric to
measure uncertainty is entropy. Conceptualizing uncertainty in terms of probability theory results in his
probabilistic theory of information. Roughly speaking, information is carried by the answer to questions,
while the specification is contained in the question.
However, probability is not the only approach. A maybe more intuitive notion of information refers
to the description of something. Since this approach conceptualizes information as the description of
something it is not probabilistic, but deterministic because a description provides a step-by-step
procedure (an algorithm) of the essence of the described. To describe something requires the knowledge
of it, so this concept is more akin to knowledge than to information. Roughly speaking, knowledge is
contained in the description of something. This deterministic concept has it theoretical home in computer
science. The theoretical foundation of this algorithmic understanding goes back to the seminal Russian
mathematician Andrey Kolmogorov and his “Three Approaches to the Quantitative Definition of
Information” (1941; 1968). His algorithmic approach is as natural that it was independently discovered by
Solomonoff (1960, 1964a, 1964b) and Chaitin (1966). Today it is generally known as “Kolmogorov
complexity” (Li and Vitanyi, 2008).
It turns out that both are two sides of the same coin. Just like two sides of a coin, they are not identical,
but they are two complementary ways of looking at the same quantity. Even this relationship is quite
intuitive: either you obtain some kind of information by a series of probabilistic questions that reduce
your uncertainty, or you receive the deterministic description of what it is about. Both are equivalent.
While these are all quite intuitive notions, it can be extremely confusing if we use them merely as
intuitive metaphors. Therefore, luminaries like Shannon and Kolmogorov worked out unequivocal and
formally precise definitions of these concepts. These turn out to be as crystal-clear that engineers were
even able to embed them into the most diverse forms of physical structure, which led to the myriad of
information technologies that surround us today. We start by identifying our most fundamental building
blocks: data, information and knowledge.
2
[email protected]
Nov. 2015
How to formalize data, information and knowledge?
Social scientists have argued that there is currently a desperate need for “sharpening the distinctions
between information and knowledge… [due to] the continued acceleration of innovations in information
and communication technologies” (Cohendet and Steinmueller, 2000; p. 195). This means, given that we
have technologies that handle information and knowledge, we need to understand what they are. We will
turn the tables of this argument around and use the proper definitions of information and knowledge
employed by digital technologies to distinguish between them conceptually. This approach is typical in
the history of science.
“Many of the most general and powerful discoveries of science have arisen, not through the study of
phenomena as they occur in nature, but rather, through the study of phenomena of man-made devices,
in products of technology, if you will. This is because the phenomena in man’s machines are simplified
and ordered in comparison with those occurring naturally, and it is these simplified phenomena that man
understand most easily” (Pierce, 1980; p. 19). Our understanding of hydrodynamics do not come from
studying fish in nature, but from building ships; our understanding of thermodynamics does not stem from
studying fire, but Carnot wondered about why gun barrels got hot when moving a cannon ball through
them. Steam locomotives were up and running two decades before he published his thermodynamic
‘Reflections’. Likewise, the details of aerodynamics to not come from studying birds, but from building in
airplanes. The brother Wright almost killed themselves while executing the first heavier-than-air flights
before we had a thorough understanding of the details of life, drag, thrust and weight. Similarly, our
understanding of electromagnetism does not originate from the study of lightning, but from electrical
engineering. Michael Faraday built his first electric motor ten years before writing down his first equation,
and Maxwell’s more general equations of electromagnetism clarified things only 40 years later. With this
in mind, it should not be surprising that our understanding of the essence of data, information,
communication and knowledge does not stem from how they occur in physical or biological nature
(including in brains and societies), but through a co-evolutionary dynamic between science and
technology –in this case, technologies of information,
communication and knowledge.
In agreement with the well-known data-informationknowledge pyramid framework from the business
literature on knowledge management (Zeleny, 1986;
Ackoff, 1989), also in information theory “information is
defined in terms of data, knowledge in terms of
information” (Rowley, 2007; p. 163). Also engineers and
computer scientists see that data, information and
knowledge as different, although very related concepts.
The Figure uses the traditional pyramid framework, but
adds a preview of the more formal concepts of data,
information and knowledge presented in the following.
Key takeaway: Our understanding of different aspects of reality often derives not from
observations in nature, but from studying technologies that make use of this aspect.
3
[email protected]
Nov. 2015
Data symbols: symbolizing differences
At the beginning there was a distinction. If absolutely nothing can be distinguished, there is nothing.
There is not even information. There is no structure in space or time, nothing to perceive. We could not
even tell if there is space or time. To identify some extension in space requires a distinction between at
least two points. How would you know that there is time? Something needed to happen (in your
perception or in what you observe). Without anything else, we do not have any criteria, but plain nothing.
But if there are differences, there is something, there is potential information.
The most fundamental distinction is that something is either there or not. This is a binary distinction.
This is the basis for the famous physics quip “it from bit” (Wheeler, 1990; p.5). First there needs to be
something that distinguishes itself from nothing. This very same idea is also the basis for digital
technology. In digital networks binary information is encoded through the existence of current or no
current (in an electronic circuit), light or no light (in a fiber-optic cable), certain wave or not (in cellular
frequency spectrum), etc. In essence, one only needs to define one thing to have a binary distinction: it
and everything it is not. For us, differences come in many shapes and forms. They can be visual, auditory,
tactile, olfactory, gustatory, imagery, or dynamic in time or structural in space, etc.. Some we cannot
perceive, or only with augmented perception through instruments. For example infrared light was no
information before the 1800s, and the notorious Neutrino ghost particles that pass through us by the
trillions every second were differences that were not detected until the mid-1950s.
Any distinction, any perceivable difference are data and symbols. In information theory, there is no
distinction between both terms data and symbols. Some people make the distinction that data or symbols
needs to stand for or suggest something else, which suggests that there is an original source that displays
perceivable difference and then some other symbolic differences (for example in an alphabet) are used
to represent these difference. So perceived sounds can be represented with visual differences drawn on
a paper in forms of musical notes or letters, olfactory smells with visual facial expressions, or patterns in
a dynamical systems with a series of 0 and 1. But if we go deeper down the rabbit hole, the initial
difference of the source (the sound, the smell, or the pattern) was already symbolizing some difference.
It stands for the difference that makes the sound (like a vibrating cord), the smell (like decomposing
matter), or the pattern (like a dynamical system). Those again stand for lower level causes. On a higher
level, other symbols, like ‘Mozart’s requiem’ stand for a collection of audible symbols (musical notes),
which stand for sounds, which again… etc.. All the way down the rabbit hole we will have to get to the
conclusion that we can create an infinite number of hierarchical multilayer systems of symbols
representing other symbols that represent other differences, etc. So in practice this question is rather one
of course-graining (what represents what), of measurement sophistication and of reality modelling (what
to consider, what to emphasize, and what to leave out), not of the essence.
Key takeaway: Any distinction, any perceivable difference are data symbols.
4
[email protected]
Nov. 2015
It was Bell Lab’s engineer Ralph Hartley (1928) who was among the first to systematically ask how
much uncertainty is reduced when we work with data symbols that represent distinctions. This is the first
step to go from plain distinctions to information. It links data symbols to the reduction of uncertainty,
which we usually refer to as information. Hartley realized that the amount of uncertainty relates to the
number of alternative choices we have: more choices, more uncertainty (Massey, 1998). So in a binary
choice, like a coin flip, we have two choices. Either tails show up or not (in essence, as already said, we
don’t need to define the other choice, which in this case is heads, as the non-existence of one already
defines the other in a binary choice). If we have a dice, we have six different possibilities. This provides
more uncertainty with regard to the true value that is revealed after allowing lady luck to play her game.
Much more uncertain is the draw of 6 lottery numbers from 49 possible numbers (with 13,983,816
possibilities). This is equal to taking the 4 hour drive from Boston to New York City while being blindfolded
and at one point throwing a coin out of the window that hits a one inch pole randomly placed at the side
of the road. This contains lots of uncertainty. This suggests that it might be useful to equate uncertainty
with the amount of surprise. If you would hit the pole, you would be very surprised, right? If a coin-flip
lands tails, less. Therefore, lottery contains more surprise and uncertainty than a coin flip. This then must
also imply that communicating the result of a winning lottery number must contain more information
(more uncertainty reduction), than communicating the result of a coin flip (whose result is less uncertain).
At this point it is important to realize that we can describe any arbitrary number of choices as a
consecutive combination of the most fundamental binary choice. For example, the six-sided dice contains
several possible binary distinctions. For example, a binary choice is if the outcome is larger than 4 or not.
Another one is if it is smaller than 2 or not, etc. It is possible to create all other kinds of data symbols from
a binary choice, which is a fact that is exploited by one of the main practical advantages of the digital over
the previous analog paradigm (see box).
As all things in this universe, also information, is subject to the notorious 2 nd Law of Thermodynamics
and—without intervention—decays over time. This is because noise (which comes from interactions with
the environment) will over time make previous differences indistinguishable. This will lead to a loss of
information. No difference, no information. One of the main benefits of digital information in contrast to
analog information is that it can be maintained without accumulating degradation (Kuc, 1999). The
following Figure schematizes an analog signal (such as on a tape, vinyl record, or paper). Random
fluctuations from the environment (such as through environmental interaction, or during copying) add
noise to the signal. In vinyl records this would be a hiss, while in paper it means that copies or old drawing
become less precise. The noise component cannot be removed and become part of the subsequent
version of the analog signal. Errors creep in and become part of the information, inevitably.
5
[email protected]
Nov. 2015
Digital signal also decay with noise through copying or decay over time (as you might have experienced if
you ever tried to recover an old and forgotten disc). However, as digital signal merely make the very basic
distinction between being there or not, these two extremes make it easy to recover the original signal.
This can be done through a processor that detects the binary signal by evaluating a threshold, and decides
if the data is there (above threshold) or not (below). Since binary is the largest difference there is, it is the
most robust as it can get.
This being said, extremely large noise spikes can still fool the processor by ‘flipping digits’, 1 to 0, or 0 to
1. This limit is set by Shannon’s “channel coding theorem” (Shannon, 1948; Cover & Thomas, 2006). In this
case, error correction code is used that adds some few additional digital in a smart way to detect those
mistakes (Pierce, 1980), making digital communication error-free in principle.
To keep track of that let us symbolize presence with 1 and absence with 0. This is simply social practice
and does not have anything more fundamental to it. You might as well replace it with any exclusive and
exhaustive differentiation (e.g. up & down, black & white, 42 and not, etc.). So for two data symbols we
get the set of the following possibilities: [0; 1]. This means that it is either not there (0) or there (1). With
two binary symbols, we can describe four choices: either none of the two is there, or the one, or the other,
or both: [00, 01, 10, 11]. With three binary choices, we already can distinguish among eight differences:
either none is there, or the first, or the second, or the third, or the first and second, or the first and third,
or the second and third, or all are three: [000, 100, 010, 001, 110, 101, 011, 111]. Note that each additional
binary symbol provides “twice as much” or “times two” number of additional symbols: if we add a third
binary symbol, we can repeat each of our 2 symbol sequences twice, one with a new 0 in front [000, 001,
010, 000] and one with a newly added 1 in front [100, 101, 110, 100]. In the same sense, if we have 16
choices, we need 4 symbols (2 ∗ 2 ∗ 2 ∗ 2 = 24 = 16); and if we have 32 choices (such as between the 32
letters of the Spanish alphabet), we need 5 symbols (25 = 32).
6
[email protected]
Nov. 2015
This can be represented
with a tree-like structure,
such as shown in the Figure.
The logic in the Figure also
exemplifies that five binary
symbols reveal the answers
to five binary questions, in
this case: is the letter in the
first half of the alphabet? If
we do not know anything
else about the distribution of
the letters, each answer to
this binary question reduces
the probability space by half.
Therefore it makes sense to
say that the uncertainty gets
reduced by half. Hartley
concluded that there are 5
binary choices in these 32 equally likely choices: we need 5 symbols to resolve all uncertainty inherent in
32 equally likely choices. The number of differences (32) and the required number of symbols (5) relate
to each other through exponentiation, in which the number of possible states is growing with an
additional power with each additional symbol (e.g. 25 = 32). This is because each new symbol multiplies
the number of choices by its value (see Figure 1). Just like subtraction counteracts addition, and division
reverses multiplication, the counterpart of exponentiation is the logarithm. Therefore, the other way
around, if we are presented with the number of possibilities and want to find out how many symbols we
need to represent it, we have to use the inverse of the power function, which is the logarithm. For
example: log2 (32) = log2 (25 ) = 5.
Key takeaway: Binary distinctions are the most fundamental ones and can be used to
recreate any other kind of multiple differences by concatenating binary choices.
Hartley concluded that the logarithm is the right tool to translate from the number of possibilities
(uncertainty) to the amount of information. Having discovered a first indication about the possibility to
quantify the amount of information begs the question about a measurement metric. Working with the
logarithm Hartley realized that the quantity of information can be measured in different measurement
units, just like weight can be measured in kilograms and pounds, length in meters and feet, and
temperature in Celsius and Fahrenheit. They are just different metrics, they do not change the quantity
itself. Hartley understood that “the base selected (for the logarithm) fixes the size of the unit of
information” (Hartley, 1928; p. 540). For example, a dice that reveals six equally likely options can be used
as a measurement tool. Two dice provide 36 different options. Therefore, when asking how many dice we
need to represent 36 options we simply work with the logarithm of base 6: log6 (36) = log6 (62 ) = 2.
7
[email protected]
Nov. 2015
Summing up, Hartley’s basic idea was that the amount of information is equivalent to the number of
symbols that is needed to describe an array of uncertain possibilities. The logarithm translates between
both. Selfless as scientists are, Hartley used the letter H as a nomenclature for his unit of information:
𝐻𝑎𝑟𝑡𝑙𝑒𝑦 ′ 𝑠 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑜𝑓 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 = log [𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐ℎ𝑜𝑖𝑒𝑠]
Key takeaway: The logarithm translates between the number of choices and the number of
symbols required to distinguish between them.
Information: probabilistic uncertainty
Information is a “difference which makes a difference” (Bateson, 2000; p. 272). The difference that is
makes is that it reduces uncertainty. If some kind of symbol shows a difference, but this does not reduce
uncertainty it should not be accounted for as information. It is simply some kind of redundant data; just
a symbol, but without information content.
Following this logic, we can say that the following: ‘same message twice, does not add information’;
‘same message twice, does not add information’, contains roughly twice as much data as information.
Instead of repeating the message twice in full, we could also state: ‘two times [same message twice, does
not add information]’. The latter has almost half as many symbols as the former, but contains the same
amount of information (reduces the same amount of uncertainty). The sentence ‘2 x [same msg twice, ds
not add info]’ even contains a third less letters and would also be understood in today’s text-message
circles. It still communicates the same information content. Much in line what modern text messengers
often do, older languages, like Arabic or Hebrew, gave up on the use of vowels, or relegated them to
diacritical glyphs, dots or dashes, because they consider them redundant. Could you obtain the same
information from: ‘tw tms [sm mssg twc, ds nt dd nfrmtn]’? All of this begs the question if there is a definite
amount of symbols in a certain message, an indisputable minimal information content?
The process that takes out all the redundant data symbols and leaves us with pure uncertaintyreducing information is called ‘compression’. The fact that data is ‘compressed’ to obtain information
illustrates again that information is only a part of data: not all data contains information. The remaining
quantity is called the “entropy of an information source” (Shannon, 1948). This is established by Shannon’s
famous “source coding theorem”, which holds that it is impossible to compress data beyond this limit
(Shannon, 1948; Cover and Thomas, 2006). As such, the entropy of a message is the ultimate information
content. Reducing the size of the message beyond that (i.e. by taking further symbols away) will definitely
reduce the information contained in the message. For example ‘2 x [sm mg tc, n add in]’ might be too little
to unambiguously decode the original message.
Twenty years after Hartley (1928), a younger colleague of his at Bell Labs, Claude Shannon, picked up
the ball and looked deeper into this issue. Shannon (1948) adopted Hartley basic idea that information
refers to the numbers of symbols needed to describe different available choices, but he additionally
discovered that Hartley’s measure only works for cases where all choices are equally likely (uniformly
distributed). However, for the more general case where options are not equally likely, we require a fine8
[email protected]
Nov. 2015
tuning of Hartley’s approach. It turns out that the key of solving this reveals the difference between data
symbols and true information.
For example, in the Figure with Shannon’s picture above, not all letters are equally likely in reality.
The letter e appears in English language 13 % of the time and the letter t with 10 %, while letters q and z
only appear 0.1 % of the time on average. Saying that the letter e is more likely, is the same as saying that
it is more certain, or less uncertain that it will show up. As a consequence, if the recipient of the
information already knows that some are more likely than others, the recipient already has less
uncertainty about the issue in question. This is a subtle but important difference. It basically tells us that
to do our accounting of information right, we have to consider that there is already some information
contained in the probability distribution of the options.
If we know nothing about the involved probabilities, the best we can do is to assume that they are
equally likely. This is intuitive and has deep roots, going back to one of the founding fathers of probability,
Pierre Simon Laplace and his so-called “principle of indifference” or “principle of non-sufficient reason”.
In the words of one of the most important economists, John Maynard Keynes: “The Principle of
Indifference asserts that if there is no known reason for predicating of our subject one rather than another
of several alternatives, then relatively to such knowledge the assertions of each of these alternatives have
an equal probability. Thus equal probabilities must be assigned to each of several arguments, if there is
an absence of positive ground for assigning unequal ones” (1921; p. 42; italic in original). Later on, Jaynes
set this intuitive idea on a very firm mathematical ground, which is known as the principle of maximum
entropy (MaxEnt) (Jaynes, 1957a; 1957b; 2013). It turns out that in this case, Hartley’s measure is the
correct measure for information. In this special case there is a straightforward relation between the
numbers of options and their uncertainty. If there are 4 equally likely options, on average each of them
1
has probability of , etc.
4
Key takeaway: In agreement with Laplace’s Principle of Indifference and Jaynes’ MaxEnt
principle, if nothing is known about the underlying distribution it is most reasonable to
assume that they are equally likely.
If we know something more about the chances of each option, then we have to consider this
knowledge in our information accounting. This is because the very information about the non-uniform
probabilities of the options, already reduces our uncertainty on average. Imagine a biased coin with a
99.999 % chance of heads. Do you face much uncertainty when flipping it once? It is pretty certain that
the biased coin will land on heads. Working with many repetitions, on average, 99,999 out of 100,000 coin
flips will contain no surprise, and therefore no information additionally to what you already know. This
again suggests that a good way of approaching our accounting is to equate the amount of uncertainty
with the amount of surprise. We can ask how surprised we would be. With a fair coin, every other coin
flip is a surprise on average. This is much more uncertain than our biased coin. Generalizing this insight
we can see that if we know the underlying probability distribution and if it is not perfectly uniform, the
revelation of the result provides less surprise on average. This gives us two important insights. First, every
assessment of information is conditional: it is conditioned on knowing the underlying distribution of the
9
[email protected]
Nov. 2015
random variable (Caves, 1990). Second, Shannon (1948) concluded that we receive less information in the
case where the probability distribution is not perfectly uniform. But how much less?
For this let us imagine that our coin-flips
themselves are generated by drawing balls
from an urn (Massey, 1998). The left urn in
the Figure has an equal choice of obtaining
a (1) and (0), with the probability of drawing
a (1) equal to 50 %, 𝑝(1) = 0.5 and the
probability of drawing a 2 equal to 50
%, 𝑝(2) = 0.5. The right-hand urn is biased toward drawing a (0), with 𝑝(1) = 0.25 and 𝑝(2) = 0.75.
Notwithstanding, drawing from the left or the right-hand urn will result in either (0) or (1). Since there are
two choices, Hartley would say that both provide the same amount of information: the result resolves the
uncertainty regarding which of the two possibilities is chosen. Hartley’s measure would give us:
log2 (2 𝑐ℎ𝑜𝑖𝑐𝑒𝑠) = 1 𝑢𝑛𝑖𝑡 𝑜𝑓 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛.
However, what if we know the content of the urns in advance? How do we account for already having
this aspect of the information? If we have this part of the information already, then we know that the
right-hand urn is more likely to provide a (0) instead of a (1). Looking at the right-hand urn, we see that,
on the one hand, there is only one chance in four of choosing the ball marked (1). Thus choosing this ball
is in a sense equivalent to choosing 1 out of 4 possibilities, or, equivalently, the probability of choosing
1
the ball marked (1) is . Following Hartley’s logic, revealing one out of four choices requires two binary
4
symbols: 22 = 4; log2 (4) = 2. On the other hand, there are three chances out of four for choosing a
3
ball marked (0). Thus, choosing such a ball has a probability of or choosing 3 out of 4 possibilities (or
4
4
equivalently you can say 1 out of choices, which makes much mathematical sense, but is more difficult
3
4
to represent with balls…). Plugging this result brute force into Hartley’s logic we get 2𝑥 = ; 𝑥 =
3
4
log2 ( ) = 0.415 … units of information. We end up with two different numbers—2 for (1) and 0.415 for
3
(0)—and would still like to know how much information is revealed when drawing from the right-hand
urn. Shannon reasoned that the adequate thing to do is to weight both by their probability of appearance.
1
The first of our results referring to (1) happens with a probability of (25 % chance of getting (1)) and the
4
3
1
3
4
4
4
other one with . So calculating the weighted mean (the expected value) among both, we get: ∗ 2 + ∗
0.415 ≈ 0.811 𝑢𝑛𝑖𝑡𝑠 𝑜𝑓 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛. This is less than the 1 unit of information obtained by Hartley’s
metric. The difference between 1 and 0.811 is the information that was contained in the probability
distribution that we considered in our accounting.
Now combining the two steps we did here we get the following:
1
3
4
∗ log2 (4) + ∗ log2 ( ) ≈ 0.811
4
4
3
Using the algebraic rules of calculating with logarithms, this can be rewritten as:
1
1 −1 3
3 −1
1
1
3
3
∗ log2 ( ) + ∗ log2 ( ) = {− ∗ log2 ( )} + {− ∗ log2 ( )} ≈ 0.811
4
4
4
4
4
4
4
4
10
[email protected]
Nov. 2015
Note the structure of this formula. We have a sum of two terms. The first refers to the probability of
getting a (1), the second one to the probability of getting a (0). Both terms are negative, and consists of a
logarithm of the respective probabilities weighted by the respective probability. We just derived
Shannon’s famous formula for entropy, which is a negative expected value (weighted sum) of the
logarithm of probabilities:
𝑆ℎ𝑎𝑛𝑛𝑜𝑛 ′ 𝑠 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐻 = −
∑
𝑝(𝑥) ∗ log 𝑝(𝑥)
𝑜𝑣𝑒𝑟 𝑎𝑙𝑙 𝑥
Now, let us derive it symbolically from the expected value of Hartley’s measure:
𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐻𝑎𝑟𝑡𝑙𝑒𝑦 ′ 𝑠 𝑚𝑒𝑠𝑎𝑢𝑟𝑒 = 𝐸 [log (avg. # of choices)] = 𝐸 [log (
1
)]
1
(
)
avg. # of choices
1
= 𝐸 [log (
)] = −𝐸[log(𝑝(𝑥))] = − ∑ 𝑝(𝑥) ∗ log 𝑝(𝑥)
𝑝(𝑥)
Key takeaway: Shannon’s measure of information considers what is already known about
the underlying probability distribution by weighing the average number of symbols needed
to describe choices.
A specialized case of Shannon’s more general entropy formula was already discovered some 75 years
earlier by the physicist Ludwig Boltzmann (1872), who used it to specify the number of possible
microstates of a gas (the number of alternative choices gas molecules have to choose their location from,
so to say). Boltzmann got his formula engraved on his tombstone after his tragic suicide. It is more specific
because it uses uniform probabilities (like Hartley) and is additionally preceded by a specific constant, the
so-called Boltzmann constant. The generalization of Boltzmann’s formula to non-uniform probabilities is
named after the physicist Gibbs (1873), which also has the Boltzmann constant. Shannon adopted the
letter H for his generalized measure for information, with which he most likely paid tribute to Hartley’s
groundwork (humble as he was). Most ironically, physicists working on basis of the groundwork of
Boltzmann and Gibbs use the letter S for their measure. But his most certainly does not have any
connection to Shannon. Actually it seems like Shannon did not really make the connection to the previous
work in physics. It was the polymath John von Neumann who pointed it out to him. In Shannon’s words:
“My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly
used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better
idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty
function has been used in statistical mechanics under that name, so it already has a name. In the second
place, and more important, nobody knows what entropy really is, so in a debate you will always have the
advantage” (Tribus and McIrvine; 1971, p.180). Fortunately, being at the midst of the information age, we
know understand very well what entropy really is. We even just derived it ourselves.
11
[email protected]
Nov. 2015
Exercises: We have three kinds of balls, white, shaded and black. If nothing else is known, how
many binary choices are needed to describe the choices.
Answer: ≈ 1.58496…
In this case, is there a difference is you use Hartley’s or Shannon’s measure for information?
Answer: no.
In line with the urn Figure from above, imagine (or draw on paper) an urn that has 6 white balls,
4 shaded balls, and 2 black balls. So the chance to get a white ball is equal to drawing 1 from 2
balls; a shaded ball 1 from 3, and a black ball 1 from 6. Do you know why? Calculate the uncertainty
to draw a specific color ball in form of the binary entropy of the entire urn. Answer: ≈ 1.4591…
How much uncertainty was reduced by considering the underlying probability distribution?
Answer: ≈ 0.1258…
Shannon (1948) proved that this formula assured that each additional unit of information (e.g. each
additional integer of H = 1, 2, 3,…) provides the same amount of information. This provides the nice feature
of additivity and since most measures we use are additive (length, weight, temperature, etc.), it
reconfirms that we have found an intuitive and very fundamental unit of measurement for information.
As with Hartley’s measure, the choice of the base of the logarithm determines the unit of information.
The two most common cases choose either one out of e = 2.71828183… possibilities (the resulting unit of
information is called ‘nats’ and is the unit of choice for some scientists), or, by far more common, the
fundamental choice of one of two possibilities, which Shannon called ‘bits’. Therefore, when working with
the logarithm of base 2 we measure bits. Shannon attributed this name to John W. Tukey, who had written
a Bell Labs memo on 9 January 1947 in which he abbreviated the term ‘binary digits’ simply to ‘bits’.
Key takeaway: One bit of Shannon information is ‘that that reduces uncertainty by half’.
The importance of compression: driver of the information explosion
Most confusingly the term ‘bit’ is often used to describe two distinct concepts (Hilbert and López,
2012). One refers to the reduction of uncertainty (bits in Shannon’s sense) and the other refers to the
number of binary digits in plain data (e.g. the number of (1)s and (0)s). In practice, Shannon’s term is often
used in telecomm engineering, while the accounting for plain data often refers to hardware questions,
such as from computer science. Somebody who constructs a storage disk or a fiber optic cable does not
care about if the bits reduce uncertainty, but only about how many (1)s and (0)s can physically be
represented, for example by presence or absence of electrical current on a chip, or light in a fiber optic
cable, or radio waves. Shannon would argue that one could test if one can compress this data and obtain
true information. This means that one would consider what is known about the probability distribution of
the source.
We can use an analogy from Shannon himself to explain how this works in practice. Shannon (1951)
referred to two identical twins who will respond in the same way when faced with the same problem.
12
[email protected]
Nov. 2015
They both know ‘the distribution inside the urn’. The receiving twin will know in advance what will be the
most probable piece of information that will be sent by its identical sending twin, since both know the
probability distribution of what the other twin is likely to say next (at the end they are twins). Therefore,
if the next piece of information is the most probable one, the sending twin will not even bother to send
it, and allow the receiving twin to take the most probable scenario for granted. This ends up in a dynamic
whereas the sending twin will only send a symbol if its content constitutes a surprise for its receiving peer,
if it is something unexpected. Sending a symbol that is to be expected does not reduce uncertainty. If both
twins have access to the same probability distribution about the source, it is clear what is to be expected.
What Shannon’s derivations show is that this is the most efficient way to communicate, that is, it requires
the least number of symbols to be sent (for an optional hands-on exercise, see below).
In the digital information age Shannon’s twins are an essential part of all kinds of information and
communication processes and have names like RAR, ZIP, GIF, JPEG, GSM, CDMA, MP3, AVI and MPEG,
among many others. On the sending end, one of them takes the message and eliminates all redundancy
by considering the probability distribution, while the other twin on the receiving end reconstructs the
same message extremely efficient on basis of the same probability distribution. In many practical
applications the probability distribution of the source does not even need to be known beforehand. In the
field of so-called universal coding twins on both ends learn the probability distribution of the incoming
symbols on the go while new messages are discovered (Cover & Thomas, 2006). For example, the wellknown Lempel-Ziv universal coding codes codifies incoming series of symbols on basis of previously seen
symbols. So even if the probability of a blue pixel in the movie you are watching is not known initially,
compression software can learn it very quickly and therefore compress data into information on the go.
Compression is very important in the digital age because it allows to store and communicate more
information with the same amount of hardware. Compression methods have improved considerably over
recent decades (López and Hilbert, 2012; Hilbert, forthcoming). Back in the mid-1980s, compression
software could compress alphanumeric text symbols on average by a ratio of 2:1. That means that if a
hardware has the capacity for 1024 different binary symbols of (1) and (0) data, compression software
could then use it to store or communicate twice as much information. Thirty years later text can be
compressed by a factor of 5:1. Sound is even more redundant and can be compressed by a factor of 20:1,
images by a factor of 24:1. In an image, it is for example redundant if a whole lot of blue pixels fill out the
upper half of an image. Enumerating all these millions of pixels in a high definition image does not reduce
uncertainty. One might as well use a short code that says something like that: upper half all blue pixel.
Moving images (video) has the highest redundancy, as it displays redundancy in both space (like images)
and in time. For example, when nothing changes on your television screen, no new uncertainty is reduced.
Much in line with Shannon’s twins, this part of the image does not need to be updated, and it makes sense
to tell the receiving ‘twin’ to display the same image until further incoming surprises. While video was not
compressed in the mid-1980s, using modern compression methods allows videos to be compressed by a
factor of 85:1 in the mid-2010s. This allows to handle 85 times more information (reduce 85 more
uncertainty) with the same number of hardware data symbols. During the global information explosion,
all of these kinds of contents have grown at different rates (Hilbert, 2014b).
The average contribution of this kind of technological progress to the global information explosion in
the digital age is outstanding. During the two decades from 1986 to 2007 during which digitalization
basically took place, it was the most important driver in the global growth of information. There are three
13
[email protected]
Nov. 2015
different driver to more information: the amount of technological infrastructure, their hardware capacity
and software compression. As an equivalent for information storage you can imagine buckets to store
stones. As an equivalent for information communication you can imagine tubes that transmit water drops.
The number of technological devices that make up the infrastructure is equivalent to the number of
buckets or tubes; the hardware capacity of the storage and telecommunication capacity to the size of the
buckets or tubes (are their thickness); the software compression would be the analogous equivalent to
the size or granularity of the content. So we have three complementary ways to store more stones or
transport more water drops: install more buckets and tubes; increase their diameter; or decrease the
average size of the content in terms of having smaller stones or drops.
The Figure shows average annual contribution of each in terms of the compound annual growth rate
between 1986 and 2007 (Hilbert, 2014a). On the left hand side it shows the drivers behind the growth of
the total amount of information the world could store in all of its technological devices (from books, over
video cassettes and hard disks, to credit cards); and on the right hand side for the total amount of
information the world could telecommunicate in all of its technological devices (from postal letters, over
phones, to broadband internet). In the case of storage, the installation of more storage infrastructure has
contributed on average a 5 % increase in information growth. This is similar to the average annual growth
of the global economy during this period. Moore’s famous law (Moore, 1995), which assured technological
progress in hardware during this period contributed an average of 8 % per year. 1 The most important
contribution, however, stems from information compression, which contributed an average 11 % to the
global growth of information. Something very similar accounts for the case of telecommunication of
information (see right hand side of Figure). The resulting growth rates of 25 % or 28 % per year are five to
six times faster than the average growth of our economies.
Note: as outlined in Hilbert (2014) the total growth of information is the product of the growth factors of the three
components of growth. For the case of storage: 1.05 ∗ 1.08 ∗ 1.11 ≈ 1.25, and for the case of telecommunication
1.08 ∗ 1.07 ∗ 1.104 ≈ 1.28.
The reason why global storage hardware capacity only doubles every 9 years (1.089 = 2), and not every 1-3 years (as
predicted by Moore’s law) is first of all that storage is not completely equal to computation (the world’s computation capacity
has grown almost twice as fast) (Hilbert and López, 2011), and second because of technology replacement (diffusion), which
does not keep up with technological progress. Not all storage devices that are used in a given year are from the current
technological frontier. This leads to a gradual replacement among different shares of devices from different years and
performance levels.
1
14
[email protected]
Nov. 2015
This shows us that Shannon’s contribution not only conceptualized the fundamental unit of the digital
age (the bit), but also led to the main driver of the global information explosion in the digital age.
Information theory allows to distill true uncertainty-reducing information from mere redundant data
symbols by taking out everything redundant that is already known. This is an impressive example of the
power of theory. By playing around with the expected value of logarithmic probabilities—just as we did
above, following in his footsteps—Shannon’s insights unleashed a digital avalanche that flooded the world
with more information it can currently process.
How to convert data into information: coding
The key to convert mere redundant data into truly uncertainty-reducing information consists in
coding. Shannon’s entropy formula is a theoretic benchmark. With his “source coding theorem” (Shannon,
1948) he showed that it is possible to compress data to size of the entropy of the source on average. This
shows that something is theoretically possible without telling us how to do it. It is nevertheless the alpha
and omega of every engineering endeavor, because it sets engineers on the challenging journey the
influential philosopher of science Thomas Kuhn (2012) almost condescendingly calls “puzzle solving”. To
say in the words of of the Roman philosopher Seneca: “If one does not know to which port one is sailing,
no wind is favorable”. For most practical purposes, the so-called “Shannon limit” was not achievable until
the discovery of “Turbo-codes” in the mid-1990s (Berrou et al., 1993), which is almost half a century after
Shannon showed us this benchmark.
Key takeaway: Shannon’s source coding theorem tell us that it is possible to find a code that
compresses data to size of the entropy of the source on average, but not beyond.
Shannon defined one bit of information as ‘that that reduces uncertainty by half’. Turning this insight
around we can now create true bits by asking questions that reduce uncertainty by half each time. If the
space is uniformly distributed, we can simply divide the number of choices in half, just as we did in the
Figure above with Shannon picture when we identified the letter C with five questions. If choices as not
uniformly likely we have to find a way to reduce uncertainty by half.
Consider the outcome of a sports game. There are regularly three possible outcomes for a team:
o
o
o
(𝑤) (win); let us assume this happens with probability: 𝑝(𝑤) = 0.25
(𝑡) (tie); let us assume this happens half of the time: 𝑝(𝑡) = 0.5
(𝑙) (lose); let us assume this happens with probability: 𝑝(𝑙) = 0.25
15
[email protected]
Nov. 2015
How many binary symbols do we need to communicate this
result to our friend? One naïve approach would be to use two
symbols (which follows the logic of a so-called fixed-length code,
fixed to two symbols). One is not enough, but as shown in the upper
Figure, two is too much, as we have one branch of our coding tree
unused. In this choice arbitrary assigned code words we do not use
the code (00). With this code, we will need a total of 16 symbols to
keep our friend up to date over the entire tournament of 8 games.
Using Shannon’s insight, we can now approach the probability space
by using each symbol to reduce uncertainty by half. If we join winning
and losing, we have a 50-50 choice between either resulting in a tie
(𝑝(𝑡) = 0.5) and not. Resolving this question communicates one bit
of information. How often will it be enough to ask this question?
Well, half of the time. So half of the time we only need one symbol,
not two. The other half of the time we need two symbols. A quarter
of the time we need two symbols to communicate winning, and
another quarter of the time when losing. One way of coding this is
shown in the lower Figure. So on average we get the following
expected value of the required number of symbols: 0.5 ∗ 1 + 0.25 ∗
2 + 0.25 ∗ 2 = 1.5. This is the same result we get with Shannon’s
entropy formula, which tells us that this code is optimal.
1
1
1
1
1
1
𝐻(𝐺𝑎𝑚𝑒) = − ∑ 𝑝(𝑔) ∗ log 𝑝(𝑔) = − ( ∗ log2 ( ) + ∗ log2 ( ) + ∗ log2 ( )) = 1.5
2
2
4
4
4
4
Imagine the outcome of the eight games of the tournament is the following:
Game
1
2
3
4
5
6
7
8
Outcome
(𝑤)
(𝑡)
(𝑙)
(𝑤)
(𝑙)
(𝑡)
(𝑡)
(𝑡)
Fixed-length code
11
10
01
11
01
10
10
10
16
Optimal code
01
1
00
01
00
1
1
1
12
On average we required only
12
8
Sum of
symbols
= 1.5 symbols to communicate all results with the optimal code. This
is a crucial insight. Every time Shannon’s formula give us a result that is not an integer, we can only fully
take advantage of it on average. The number 1.5 is an average number over all possible outcomes. If you
ask how many bits we need to communicate one specific result, it will depend on that result. Even with
an optimal code we will need 2 bits to communicate a win. The fact that we require many events makes
sense when remembering that Shannon’s formula works with a weighted average. Any efficiency gain that
we might obtain from converting data into information only works on average over several events, while
for one single event, we might be just as inefficient as when we would simply ignore what we know about
the underlying probability structure of the events.
16
[email protected]
Nov. 2015
Key takeaway: Compression of data into information makes use of the average behavior of
the sequence of symbols, which is denoted by their probabilities of occurrence.
Exercise: For following code tree presents a so-called prefix
code or instantaneous code, because no codeword is a prefix
of any other and you can decode it instantaneously. What
letters are represented by the following sequence:
011101101001110001101001110 Answer: DECODE_CODE
The logic that we used for our optimal code was generalized by a student who looked for an easy way
out of taking his final exam. In 1951, David Huffman took the offer of his Professor (a colleague of
Shannon, Robert Fano) to work on a homework problem on finding the smallest code instead of taking
the final. What the Professor did not mention was that no one at the time knew the optimal solution. The
state of the art was Fano’s own suboptimal Shannon-Fano code. After a while, Huffman was about to
switch strategies by preparing for the test and throwing away his scratching on the problem. “As one of
the papers hit thee trash can, the algorithm came to him” (Huffman, 2010), with the result that he outdid
the well-known method of his Professor (Huffman, 1952).
To begin with, Huffman coding makes use of a coding practice that has long been known: use longer
symbols for less likely messages. This way, the longer codes show up less often and the average code
length stays small. Around 1838 Samuel Morse made use of this idea and walked into the printing press
of the local newspapers to see which letters of the metal typeset looked most used. He found that ‘e’
looked much worn, so he assigned it the single ‘dot’ in his famous code of dots and dashes, while he
assigned the lengthy code ‘dot-dash-dot-dot’ to the rather untouched letter ‘x’. Huffman uses the same
logic and finds the optimal mix
between likelihood and code
length by always combining the
two least likely symbols into one
symbol until we are finally left with
only one symbol. At each merger,
the probability of the merged
options are added and this new
probability assigned to the merged
option. For example, in the Figure
distinguishes among five different
kinds of weather. The two least
likely involve sunshine (each with
15 % likelihood). The first step
merges them, creating a new
17
[email protected]
Nov. 2015
symbol with probability 0.3. So we are left with four options, while the two least likely include the rainy
one with probability 0.2 and another one with 0.25. This will then be our second merger, leaving us a new
merged option with probability 0.45. We are now left with three options (0.3, 0.45 and the single cloud
with 0.25). We merge the two least likely to create a new option with 0.55. Now are are left with only two
options and merging them leaves us with only one symbol. Assigning binary code to the grown branches
provides the final code. Calculating the average code length we see that the three symbol code for
sunshine (000)can be expected to appear 15 % of the time, etc., resulting in with average code length of
0.15 ∗ 3 + 0.15 ∗ 3 + 0.25 ∗ 2 + 0.25 ∗ 2 + 0.2 ∗ 2 = 2.3. The entropy of this weather source is:
𝐻(𝑊𝑒𝑎𝑡ℎ𝑒𝑟) = − ∑ 𝑝(𝑤) ∗ log 𝑝(𝑤) ≈ 2.285 …, which shows that our code almost approaches
Shannon’s limit. Unfortunately, since Huffman’s code is optimal, that is as good as it gets.
Exercise: A retailer
studies
the
shopping behavior
of its customers
and classifies them
into four different kinds, those that buy a lot, or just a couple of items, or just one item, or nothing.
During a representative hour of their business they observe the following time series of
purchases. What is the probability of each of the four behaviors based on this representative
sample? In order to communicate these different shopping behaviors most effectively, how many
bits are needed in theory- that is: what is the entropy of the source in bits- how often does
uncertainty get reduced on average? Use Huffman coding to see if it is possible to come up with
code that compresses down to the entropy of this shopping source? Use your Huffman code to
communicate the series of eight shopping purchases above to somebody else. What is the code?
How many symbols does this code series have to represent how many events?
Answers: 0.5, 0.25, 0.125, 0.125 // 1.75 // it is possible // depending on how you used (1)s
and (0)s either 01001110001011 or 10110001110100 // 14 symbols to represent 8 events, or 14/8
= 1.75 symbols per event.
One very intriguing result of optimal compression is that the resulting code looks completely random.
For example, count the number of (1)s and (0)s from the example of communicating the results of eight
games with an optimal code. According to the Table we have exactly six (0)s and six (1)s. This is no
coincidence and must happen over long series of optimal codes on average (deviations can happen
because the code is not optimal or because the sample is not representative of the true probabilities). The
simple argument is that if it would not be, if there would be more (0)s or (1)s in the series, we could create
a code on the higher level that would compress this further. For example, if it turns out that we end up
with 𝑝(1) = 0.25 and 𝑝(2) = 0.75 (such as in our example of drawing balls from on urn from the Figure
above), we know that we can exploit this structure and based on this insight compress it further down (in
this case to some 0.811 bits instead of 1 bit, as we did above). Only if the number of (0)s and (1)s is
uniform, there is no compression to be done anymore. In short, the output of a perfect compressor are
perfectly random bits.
18
[email protected]
Nov. 2015
One the one hand, cryptographers make use of this when creating secret code. Good codes look
perfectly random. Cryptography is intimately linked to information theory and most of the early
information theorists (including Shannon) have been working on secret codes (especially during the times
of World War II). On the other hand, this also means that every time when we see something truly random,
truly mixed up and chaotic, be it in nature or society, it might be that it is nothing else than a perfectly
compressed code of something or somebody…(?)
From information to communication: channels
Shannon (1948) himself called his theory the “mathematical theory of communication” and
“communication theory” was also the dominating term for many decades (e.g. Pierce, 1980), until it was
replaced by the term “information theory”. Both apply, because Shannon solved two fundamental
problems in this 1948 work. The previous sections discussed the one related to information, namely, what
is the purest form of information and how to quantify it (it refers to Shannon’s “source coding theorem”).
The second is about the maximum rate of information that can be communicated over a channel (it refers
to Shannon’s “noisy channel coding theorem”). Thinking about an analogy from transport, the first asks
about how to best measure what it to be transported: should we focus on weight, volume, color, smell,
or a combination thereof, and what is the purest measure we should use to quantify what matters? The
second asks about what is the maximum achievable rate of transport given the physical restraints of our
reality: is there a limit and is it achievable? Shannon answered both of these second questions with yes.
To get started, let us follow the leading information theory text book and ask “What do we mean
when we say that A communicates with B? We mean that the physical acts of A have induced a desired
physical state in B… The communication is successful if the receiver B and the transmitter A agree on what
was sent” (Cover & Thomas, 2006; p. 183). Another way of asking the same question is to ask how much
mutual understanding is there between the communicating parties? For example, if I tell you that I
imagine a house on the beach, the question of successful communication can be quantified by somehow
measuring how much the house on the beach that I have on my mind has in common with the house on
the beach that is now on your mind. In line with this reasoning Shannon’s answer to the quantity of
information communicated is that it is exactly equivalent to the amount of uncertainty that got reduced
between them. If I tell you that I actually imagine a one story bungalow with balcony on the beach during
sunset we probably get closer. Therefore, information is the opposite of uncertainty, and communication
is the process of uncertainty reduction. The amount of common ground arising from the communication
is quantified by the so-called ‘mutual information’, the information that is mutual between the
communicating parties. It is the aspects of a house on the beach on my mind that are also present in the
house on the beach on your mind. Shannon himself never used the term mutual information and merely
talked about the reduction of uncertainty.
Key takeaway: Information is the opposite of uncertainty, and communication is the
process of uncertainty reduction.
19
[email protected]
Nov. 2015
In information theory this is often depicted with the help of a Venn-diagram, such as in the Figure
(Yeung, 1991; James et al., 2011). The mutual information is the shared area of the two circles that
represent the sender and receiver. Shannon’s reasoning is that the mutual information between both is
the uncertainty in the receiver, minus the uncertainty that remains in the receiver after the
communication. Since the arising common ground is supposed to be mutual, it has to be the same for the
sender and the receiver and we can turn this around. The mutual information between both is the
uncertainty in the sender, minus the uncertainty left in the sender that is not shared with the receiver. It
is relative to say that the one describes the other, or the other way around, reason why information
theorists after Shannon added the word “mutual” to information. The technical term is that mutual
information is a symmetric measure. It is the same, independent from which side it is looked at. The Figure
presents the case where the mutual information between S and R is calculated as the the uncertainty of
R, minus the remaining uncertainty of R when knowing S (see Figure S.5). In almost all cases (technological
or human) there is never a complete match between the sent and received concept. It is very difficult to
eliminate all uncertainty. Mutual information is also measured in the same units as entropy while each bit
reduces existing uncertainty by half.
Exercise: Draw out the three Venn-diagrams from the other perspective, that is the mutual
information between both (their intersection) as the uncertainty in the sender, minus the
remaining uncertainty in the sender that is not shared with the receiver perspective.
Answer: mirror the three diagrams on a vertical axis.
For example, imagine a communication between an information theorist (sender) and a layperson
(receiver). The layperson asks the following question: “So you are telling me that information is something
quantifiable, just like weight temperature or length?” Most of the time the information theorist is patient
and answers with the truth, with a candid “yes sure!”. However, once a week (typically Fridays) the
information theorist becomes tired of this question and answers with a sarcastic “no, of course not, what
are you thinking!” In this case, the receivers are very bad at picking up sarcasm (who would expect a
thoughtful theorist to be sarcastic on such a serious issue…), so given that sarcasm is the case, there is
merely a 50%-50% chance that it is understood as such. When the theorist gives a candid true answer, it
is more often than not interpreted as such, but not always (in 5 out of 8 cases, or 62.5% of the time). The
Figure presents this communication channel schematically. The left side presents the two possible choices
for the sender, the right side the two possible interpretations of the receiver. In between we have the
options that the message is understood correctly (the horizontal lines indicate that sent truth is received
20
[email protected]
Nov. 2015
as sent and sent sarcasm is received as sarcasm), and the diagonal cross lines that indicate
misunderstanding (sent truth is received as sarcasm and sent sarcasm is received as truth). These diagonal
miss understanding are typically referred to as ‘noise’ by engineers.
The presented input is sufficient to infer several other values. For example, the combination of the
fact that the sender speaks the truth 4 out of 5 times (𝑃(𝑆𝑇 ) = 0.8)) with the fact that given the sending
of a true response it is understood as such in 5 out of 8 cases (𝑃(𝑅𝑇 |𝑆𝑇 ) = 0.8) allows us to calculate the
joint probability between sending and receiving a true message (𝑃(𝑅𝑇 , 𝑆𝑇 )). This is done according to the
well-known definition of what a conditional probability actually is:
𝑃(𝑅𝑇 |𝑆𝑇 ) =
𝑃(𝑅𝑇 , 𝑆𝑇 )
;
𝑃(𝑆𝑇 )
or: 𝑃(𝑆𝑇 ) ∗ 𝑃(𝑅𝑇 |𝑆𝑇 ) = 𝑃(𝑅𝑇 , 𝑆𝑇 )
In our case: 0.8 ∗
0.625 = 0.5. We can do the
same for the joint probability
of sent sarcasm and received
truth. In this case the sender
is sarcastic, but the receiver
does not understand it as
such and takes the answer at
face value. In this case:
𝑃(𝑆𝑆 ) ∗ 𝑃(𝑅𝑇 |𝑆𝑆) = 0.2 ∗
0.5 = 0.1 = 𝑃(𝑅𝑇 , 𝑆𝑆 ).
According to the basic laws of
probability, the sum of all
involved joint probabilities
sum up to the marginal probability (see table below and compare all ingredients of the table with the
numbers in the channel Figure). This allows us to conclude that the receiver takes the answer to be true
60 % of the time. The channel reveals that this is the result of correctly understanding when a truthful
answer is understood correctly (50 % of all communications), and when a sarcastic answer is
misinterpreted (10 % of all communications). This reveals that while the information theoretic is only
sarcastic 1 out of 5 times, the typical sender understands it to be sarcastic 2 out of 5 times (𝑃(𝑅𝑆 ) = 0.4).
conditioned on sender
𝑅𝑇
𝑅𝑆
𝑆𝑇
0.625 0.375
1
𝑆𝑆
0.5
0.5
1
joint between sender and receiver
𝑅𝑇
𝑅𝑆
𝑆𝑇
0.5
0.3
0.8
𝑆𝑆
0.1
0.1
0.2
0.6
0.4
conditioned on receiver
𝑅𝑇
𝑅𝑆
𝑆𝑇
0.833 0.75
𝑆𝑆
0.167 0.25
1
1
Equipped with the table of the joint probability now also allows us to calculate the probabilities
conditioned on the receiver. From the perspective of the receiver, received sarcasm actually is a truthful
answer 3 out of 4 times (𝑃(𝑆𝑇 |𝑅𝑆) = 0.75. In information theory, the resulting conditional probabilities
are known as the equivocation and taken as a “natural measure of the information lost in transmission”
(Pierce, 1980; p. 154).
21
[email protected]
Nov. 2015
So what is the amount of information communicated over this channel? In line with the Venn-diagram
figure from above, Shannon’s answer is that it is the uncertainty in the receiver, minus the uncertainty
that remains in the receiver after the communication. The uncertainty in the receiver can be calculated
with the entropy of the receiver: 𝐻(𝑅) = − ∑ 𝑝(𝑟) ∗ log 𝑝(𝑟) = −(0.6 ∗ log2 0.6 + 0.4 ∗ log2 0.4) ≈
0.971. The remaining uncertainty of the receiver is the weighted uncertainty conditioned on the sender,
which is the entropy of each conditioning case, weighted by the weight of this case (its respective
marginal). So we calculate the entropy for each condition of the sender: 𝐻(𝑅|𝑠𝑇 ) = (0.625 ∗
log2 0.625 + 0.375 ∗ log2 0.375) ≈ 0.954 and 𝐻(𝑅|𝑠𝑆 ) = (0.5 ∗ log2 0.5 + 0.5 ∗ log2 0.5) = 1, and
weigh them by their occurrence, which is: 0.8 ∗ 𝐻(𝑅|𝑠𝑇 ) + 0.2 ∗ 𝐻(𝑅|𝑠𝑆 ) ≈ 0.964. in general, the socalled conditional entropy is defined as:
𝐻(𝑅|𝑆) = −
∑
𝑜𝑣𝑒𝑟 𝑎𝑙𝑙 𝑠
=−
∑
𝑝(𝑠) ∗
∑
𝑝(𝑟|𝑠) ∗ log 𝑝(𝑟|𝑠)
𝑜𝑣𝑒𝑟 𝑎𝑙𝑙 𝑟
𝑝(𝑟, 𝑠) ∗ log 𝑝(𝑟|𝑠) = −𝐸𝑠 𝑎𝑛𝑑 𝑟 [log 𝑝(𝑟|𝑠)]
𝑜𝑣𝑒𝑟 𝑎𝑙𝑙 𝑠 𝑎𝑛𝑑 𝑟
Key takeaway: The conditional entropy is the remaining uncertainty after the probabilities
of the conditioning variable are known.
In line with the visual presentation in the Venn-diagram figure, we know calculate the mutual
information between both as: 𝐼(𝑆; 𝑅) = 𝐻(𝑅) − 𝐻(𝑅|𝑆) = 0.971 − 0.964 = 0.0074.
Exercise: Calculate the mutual information from the other perspective of the channel, that is:
𝐼(𝑆; 𝑅) = 𝐼(𝑅; 𝑆) = 𝐻(𝑆) − 𝐻(𝑆|𝑅). What values does 𝐻(𝑆) and 𝐻(𝑆|𝑅) have in bits?
Answer: 𝐻(𝑆) ≈ 0.722 and 𝐻(𝑆|𝑅) ≈ 0.715.
There is another way to understand mutual information. It is conceptually similar to the more a wellknown concept of a covariance, which is at the heart of all correlation and regression work in statistics.
When we say that something correlates with something else we quantify this by measuring the difference
between (the conceptual equivalent of) the joint and the independent events. If there is not much
difference, we say that there is not much correlation between both. If there is not difference, the events
are independent, which means that one does not give us any indication about the other. Mutual
information consists of the same logic and one can reasonably argue that it is actually a more fundamental
version of what covariance does. Since the difference (between joint and independent) is among
logarithms, this turns into a ratio, and the mutual information consists of its weighted average:
𝐼(𝑆; 𝑅) =
∑
𝑜𝑣𝑒𝑟 𝑎𝑙𝑙 𝑠,𝑟
𝑝(𝑟, 𝑠) ∗ {log 𝑝(𝑟, 𝑠) − log(𝑝(𝑟) ∗ 𝑝(𝑠))} =
∑
𝑜𝑣𝑒𝑟 𝑎𝑙𝑙 𝑠,𝑟
𝑝(𝑟, 𝑠) ∗ log
𝑝(𝑟, 𝑠)
𝑝(𝑟) ∗ 𝑝(𝑠)
22
[email protected]
Nov. 2015
Exercise: Calculate the mutual information from this formula directly.
Key takeaway: There are two basic ways to understand mutual information:
* as the difference between total and conditional uncertainties
* as the difference between joint and independent distributions
The importance of the metric of mutual information is underlined by Shannon’s noisy channel coding
theorem. It establishes that for any given degree of noise of a communication channel (this are the
misunderstandings or cross-over in the channel), it is possible to communicate discrete data error-free up
to the so-called channel capacity, which is the maximum mutual information. This establishes a theoretical
maximum information transfer rate of a channel for a particular noise level. So while Shannon’s source
coding theorem shows a limit to how small information can be made (how much it can be compressed
without loss of information), Shannon’s noisy channel coding theorem shows how much can maximally
be transported over a channel. Continuing with our analogy of transport, the one tells us how to measure
the content that is to be transported and therefore indicates methods to reduce it to what really matters,
while the other tells us how much can possibly be transported given the conditions that we have. These
are the two most fundamental aspects of the issue involved: information (entropy) and communication
(mutual information).
Key takeaway: Shannon’s channel coding theorem tell us that it is possible to to
communicate error-free over a noisy channel up to the maximum of the mutual information
of the channel.
Now you might ask yourself: how we can use this logic to understand what is going on when I
communicate my mental image of a house on the beach? Here it is important to remember that
information theory works with probabilities and those only arise on average over many instances. So the
sarcasm channel depicted above describes the average weekly communication between the information
theorist and many receivers. A respective channel of a house on the beach might have several variables
(like sunset or not, one story bungalow or multiple stories, balcony or not), and any arising communication
channel could only be described on average over many communications. This is a very important point to
remember:
Key takeaway: Information theory is a branch of probability theory and therefore works
with probabilities, which represent often averages over many instances or, alternatively,
average believes.
But there is another, maybe more intuitive way we can look at information. “The most natural
approach to defining the quantity of information is clearly to define it in relation to the individual object”
23
[email protected]
Nov. 2015
(Li and Vitanyi, 2008; p. 101). This aims not at the average, but at the specific case. In terms of our beach
house analogy, it focuses on a literal bit-by-bit description of the specific house in question, not on the
average case. It contains knowledge to describe the house in a step-by-step procedure.
24
[email protected]
Nov. 2015
Knowledge: algorithmic descriptions
The branch of philosophy concerned with the nature and scope of knowledge is called epistemology.
In applied epistemology the relation of information and knowledge is often described as “information that
have been organized” (Rowley, 2007; p. 172) or “as actionable information” (p. 175). While “information
is to be interpreted as factual… knowledge, on the other hand, establishes generalizations and
correlations between variables. Knowledge is, therefore, a correlational structure” (Saviotti, 1998; p. 845).
This suggests that knowledge consists of an interlinked network of information. From an individual
perspective, psychologists argue that behavioral patterns contain knowledge about the world that allow
us to make relevant decisions (Tversky and Kahneman, 1974); from a social perspective, economists argue
that a firm’s organizational knowledge is stored in its routines (Nelson and Winter, 1985), sociologists
argue that it is embedded in institutionalized mechanisms, procedures and habits (Powell and DiMaggio,
1991), and anthropologists claim that cultural norms represent guidelines for behavior consisting of
accumulated knowledge (Boyd and Richerson, 2005).
The branch of science concerned with the nature and scope of knowledge is called computer science.
In computer science, the study of concepts like actionable information, patterns, routines, guidelines or
networked information belongs to the study of algorithms. In essence, an algorithm is a step-by-step
recipe of doing something. Formally, it “is an ordered set of unambiguous, executable steps that defines
a terminating process” (Brookshear, 2009; p. 205). In plain English, it is a feasible recipe that will not send
you into an endless loop where you will go in circles forever. Our world is full of such recipes. Any kind of
formal manual and instruction, any kind of regulation or law, but also any kind of customary habit falls
into this category. The basic characteristic is that a process or structure is described. The characteristic
trademark of algorithms is that they are deterministic. They give clear instructions. If there is knowledge
on how to describe something, there is no doubt, there is a deterministic process or procedure that
defines how
things go from
one to the
next. This is
often
implemented
by
a
conditional
logic of steps
based
on
specific cases,
such as “if the
crust
is
brown, take it
out”, “if it is
Mo-Sat 8am,
then park here”, or “if the other team attacks, then defend”. Computational algorithms often follow the
same deterministic “if-then” logic.
25
[email protected]
Nov. 2015
Key takeaway: An algorithm is is an ordered set of unambiguous, executable steps that
describe a process or object.
Theoretically, the right measure to quantify the amount of knowledge contained in a deterministic
description has been established by Andrey Kolmogorov (1941; 1968), and independently developed by
Solomonoff (1964) and Chaitin (1966). It is known as “Kolmogorov complexity” (Li and Vitanyi, 2008), or
any combination of the foregoing three names (Crutchfield, 2012). The basic idea “has been established
by von Mises in a spirited manner” (Kolmogorov, 1963, p. 369) in the early 1900s. The economist Ludwig
von Misses worked on a definition of information as the opposite of randomness, while defining
randomness according to his “Regellosigkeitsprinzip” or the “lack of regularity principle”. While
pioneering, von Mises definitions turned out to be flawed and incomplete (Martin-Löf, 1966). Kolmogorov
clarified this notion in 1941 (in Russian and German), but it took until the 1960s for Kolmogorov’s ideas
got around the iron curtain of the cold war (1968), a time during which Solomonoff (1960, 1964a, 1964b)
and Chaitin (1966) worked on similar concepts independently. The resulting three different
conceptualizations have been shown to be equivalent (Leung-Yan-Cheong and Cover, 1978) and aim at
“the development of the concepts of information and randomness by means of the theory of algorithms”
(Zvonkin and Levin, 1970, p.83).
Kolmogorov complexity: description length
The idea behind Kolmogorov’s measure is that if we have information about something, it means that
we can describe it. Using his own words: “In practice, we are most frequently interested in the quantity
of information ‘conveyed by an individual object x about an individual object y’” (1968; p. 163). To do so,
one could define the quantity of knowledge about an object “in terms of the number of bits required to
losslesly describe it. A description of an object is evidently useful in this sense only if we can reconstruct
the full object from this description” (Li and Vitanyi, 2008; p. 101). One more objective way of describing
something is to create an algorithm that contains all required information. When the algorithm can
describe everything about the object or process, we can say that it knows everything about it that we gave
it to know. For example, an algorithm that describes an object that is about to be 3D printed contains all
the information about this object. In essence, it contains it to the level of detail that will later on be
included in the 3D printed result. If we want more detail, we need a larger algorithm that contains more
information. An algorithm that describes how to play a perfect game of checkers (Schaeffer et al., 2007)
contains all information about the procedure of flawless checkers. It is therefore the algorithm needed to
describe something in all of its detail. Algorithms consist of symbols. So a very natural way of quantifying
the content of the algorithm is to ask about the minimum number of symbols needed to efficiently
describe an object or process to a specific level of detail. This in essence is Kolmogorov’s measure.
Kolmogorov’s measure is low if it is easy short to describe with few symbols, and high if it requires a
lengthy description.
Key takeaway: Kolmogorov complexity measures the minimum amount of symbols needed
to describe an object or process.
26
[email protected]
Nov. 2015
For example, the list below shows five processes. It consists of a fluctuating process that goes up
and/or down over 20 periods. It might represent the stock market, or the popularity of a social media
post, or student grades over the year, etc. We want to compare their Kolmogorov complexity in terms of
the length of the algorithm required to describe each of them. Which of the following strings of 20 letters
is more or less difficult to describe?
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
(a) ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
↑
↑
↑
↑
↑
↑
↑
↑
↑
↑
(b) ↓ ↑ ↓ ↑ ↓ ↑ ↓ ↑ ↓ ↑
↓
↑
↓
↑
↓
↑
↓
↑
↓
↑
(c) ↓ ↓ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
↑
↑
↑
↑
↑
↑
↑
↓
↓
↓
(d) ↓ ↓ ↑ ↓ ↓ ↑ ↓ ↓ ↓ ↓
↑
↑
↑
↑
↑
↑
↓
↑
↑
↓
(e) ↑ ↑ ↓ ↑ ↓ ↑ ↓ ↑ ↓ ↑
↑
↓
↑
↑
↓
↓
↑
↓
↓
↓
String (a) and (b) are quite easy to describe. This is because it is straightforward to detect a pattern or
regularity in them. This clear pattern allows us to use a short recipe or algorithm to describe them. For
example, using pseudo-code, we could say:
(a) print: ↑ 20 times, stop.
(b) print: [↓ ↑] 10 times, stop.
We can also come up with a reasonably short description of string (c):
(c) print: ↓ ↓, ↑ 15 times, ↓ ↓ ↓, stop.
This shows that whenever we have a string of something similar, we can compress this description by
stating more of the same. Much in line with what information theory suggests when compressing data
into entropic information, what we do here is taking advantage of redundancy. If something is redundant,
it does not provide new information and we can shortcut its description. Now, which of the above
descriptions is shortest and which longest? The pseudo-code exercise seems to suggest that the
Kolmogorov complexity of string (a) is less than (b), which is less than (c). Let us go a bit more systematic
about this evaluation by creating a code that is a bit more formal and then compare their length.
As with information theory, also in computer science we can in principle use any kind of symbols to
formulate our algorithm, such as the Roman alphabet with symbols like a, b, c; or Greek symbols like β
and μ; or Chinese symbols, cartoon images that characterize an alphabet of cognitive concepts, etc. An
again, the most basic code is binary, reason why digital computers use the binary code on the lowest level
to construct all these higher level symbols. In the electronic circuit of a silicon chip, this is represented by
“existence of electrical current” (e.g. (1)), and “no current” (0). So let us code an upward trend ↑ with (1)
and a downward trend ↓ with (0).
27
[email protected]
Nov. 2015
We require two different groups of code
words: one that distinguishes among the upand downward tendency, and another one
that specifies the length of any eventual
repetition and redundancy of symbols. In order
to keep them apart, we have to mark each of
them differently. Using a binary code, we can
start all the specifications of the tendency with
(1) and and all the specifications of the length
of an eventual sequence with (0). The Figure
shows the resulting logic.2 This code allows us
to compress every sequence of symbols that
repeats more than 4 times by simply stating
the number of repetitions. Using this code, we
could encode strings (a), (b) and (c) with the
following prefix free code:
Number
of bits
7
Pseudo (a)
↑
20 times
Code (a)
11
00000
Pseudo (b)
Code (b)
↓
10
↑
11
10 times
01010
Pseudo (c)
Code (c)
↓
10
↓
10
↑
11
9
15 times
00101
↓
10
↓
10
↓
10
17
Using our code it shows that we need 7 bits for sting (a), 9 for string (b), and 17 for string (c). This
confirms our previous intuition: (a = 7 bits) < (b = 9 bits) < (c = 17 bits). We can now also find a code for
strings (d) and (e). Based on the coding scheme in the Figure, we will compress every sequence of symbols
that repeats more than 4 times:
Number of
bits
Pseudo (d) ↓ ↓ ↑ ↓ ↓ ↑ ↓ ↓ ↓ ↓ ↑
6 times
↓ ↑ ↑ ↓
Code (d) 10 10 11 10 10 11 10 10 10 10 11
01110
10 11 11 10
35
Pseudo (e) ↑ ↑ ↓ ↑ ↓ ↑ ↓ ↑ ↓ ↑ ↑ ↓ ↑ ↑ ↓ ↓ ↑ ↓ ↓ ↓
Code (e) 11 11 10 11 10 11 10 11 10 11 11 10 11 11 10 10 11 10 10 10
40
2
The code is not the most efficient code, but it satisfies two main ideas: it is pre-fix free, which means that it can be decoded as
soon as it is received (satisfying the Kraft inequality: every branch is cut off after it is used as code: every used branch is final) ,
and it allows us to compress strings with many (up to 16 consecutive) repetitions.
28
[email protected]
Nov. 2015
For string (d) we require 35 bits and for string (e) 40 strings. This is quite inefficient and does not even
justify doing all this effort of setting up our code: we might have as well simply encoded upward with (1)
and downward with (0) and then written them both out as they are in 20 bits. This would have been more
efficient for these two cases. Especially for string (e) we do not find any opportunity for compression. No
pattern was found that could have enabled us to efficiently compress the string. Kolmogorov said that
such strings are truly random, or, in von Mises’ terms “regellos” (‘ruleless’ or irregular). Those strings
have maximum Kolmogorov complexity. In case where we are able to detect some pattern, we were able
to compress the string, which we did by compressing equivalent consecutive symbols (or groups of
symbols). In the sense of a dynamical system, these recurring patterns could for example be periodic
orbits, such as string (b), which would represent the recurring pattern of day and night. Also pattern (c)
seems to be unlikely to be completely random. It seems like there is some kind of generative mechanism
that produces this long sustained upward stretch. Something seems to produce this pattern. Kolmogorov
complexity aims at quantifying how much structure there is. Even so our code is not the most efficient
code2 (it was mainly chosen for pedagogic reasons), it still achieves to compress some strings to a length
smaller than this brute force method. The smallest possible description length of our binary string of
length 20 turns out to be 7 bits.
Key takeaway: Kolmogorov complexity is largest when there is no structure or pattern, but
complete randomness; and lower when there is much regularity.
Note that Kolmogorov’s metric does not only consider the average appearance of differences (how
many ups and downs?), but also their sequence. Strings (b), (d) and (e) all have ten ups ↑ and ten downs
↓. In Shannon’s sense of drawing from a random urn (see Figure above), they would provide us with the
same average entropy.3 However, the sequence in time results in the fact that we can compress some
more than others. They have different length programs to describe them.
You might have wondered if our logic depends on this quickly drawn up code from the Figure above. 2
The answer is ‘not really’. It does depend on the code, but the only difference between this and the
optimal code is a constant. Since this constant is equal for the codification of any structure or pattern, it
does not matter for our comparative purposes. The constant basically consists of a translation program
that translates from our chosen code (inefficient as it may be) to the optimal code conceivable. For
example, in the translation from our Pseudo code to our code we neglected adding the term ‘stop’ at the
end, as it would be added to each one of the strings. For very long patterns or intricate structures this
overhead becomes such a small percentage of the total code length that it can be neglected for all
practical purposes (as long as the code is not exaggeratedly inefficient). This is known as the invariance
theorem: given any description, the optimal description language is at least as efficient, with some
constant overhead.
3
The technical reason stems form the asymptotic equipartition property, which holds that they are all part of the same typical
set.
29
[email protected]
Nov. 2015
Knowledge and algorithms
So is it fair to equate algorithms with knowledge? Is knowledge not something inherently human?
Imagine you visit your friend’s hometown for the first time and drive around. You have to go to the airport
and luckily your friend ‘knows’ the town and all its quirks like the back of her hand. She knows all shortcuts,
even considers what time traffic creeps up at which location, and her in-depth knowledge allows her to
quickly come up with an alternative routes on the go. Nobody would doubt saying that your friend ‘knows’
the city well. Now replace your friend with your GPS navigation system. The GPS does not ‘know’?
Artificial intelligence (A.I.) and robotics have long challenged the assumption that technology does
not contain knowledge (Wiener, 1948; Minsky, 1986; Russell and Norvig, 2009; Siegwart et al., 2011).
Within A.I., machine learning has emerged as the method of choice for developing practical applications.
Machine learning essentially imitates the way humans acquire knowledge. We do not teach our toddlers
the difference between a car and a motorcycle by given them detailed list of definitions about how many
wheels and doors are to be expected. We simply point several examples out to them. This might lead to
conflict when seeing a trike motorcycle for the first time, but it allows for flexibility to create new
categories in a flexible manner. Software developers now recognize that it can be far easier to train
systems through exposure than to program it manually (Jordan and Mitchel, 2015). Autonomous vehicles
on Mars, credit card fraud detection systems, and sovereignly controlled Metro systems are powered by
biologically inspired solutions like genetic algorithms (Mitchell, 1998), artificial neural networks
(Yegnanarayana, 2004), and more recently ‘deep learning’ (Hinton et al., 2012). For example, deep
learning consists of evolving multilayer networks that can employ billions of parameters to learn how to
make sense of the world (Schmidhuber, 2015). The outcome of learning from unlabelled audio-visual data
(Hinton and Salakhutdinov, 2006) is strikingly similar to outcomes of neural networks employed by
biological agents. Artificial agents bestowed only with rudimentary sensory recognition (of pixels) and a
reward signal (the goal to increase a score) can evolve an intelligence that outperforms every human
expert in a matter of hours (Mnih et al., 2015).
Human kind has started to rely heavily on the knowledge of such artificial intelligence. For example,
we trust A.I. algorithms to know which kinds of news we should be exposed to and have come to relay
almost exclusively in this filter when updating ourselves about our friends in online social networks
(Bakshy et al., 2015), and with one in three marriages in America beginning online (Cacioppo et al., 2013)
algorithms have also started to take an undeniable role in sexual mating and humanity’s genetic
inheritance. We allow them to be the judge about the prices at which we should have access to goods and
services (Hannak et al., 2014), and we trust that automated trading algorithms have the knowledge to
execute three out of four transactions on the largest resource exchange of homo sapiens (U.S. stock
markets) (Hendershott et al., 2011). We trust that intelligent algorithms can take better care of our health
and security than human could, because they have more knowledge, such as when we use them on a daily
basis in anti-lock braking in cars and autopilots in planes; by taking care of the safety management the
entire engineering staff of a Metro system (Chun & Suen, 2014; Hodson, 2014); or by recognizing that no
human team, however sophisticated, would have the knowledge to manage humanity’s main energy
source (the electric grid) as well as artificial intelligence can (Ramchurn et al., 2012). Artificial intelligence
can by now be used to classifying human personality better than the same individuals can self-identify
their psychological character traits (Wu et al., 2015), and is learning the skills to take over white-color
employments whose prerequisite knowledge require a human intelligence an extensive university
30
[email protected]
Nov. 2015
careers, such as pharmacist robots that select, label and dispense medicine based on a doctors recipe,
while checking for millions of possible side-effects for the patients individual health record in real-time
(UCSF, 2011), or the provision of fully automated cognitive behavioral therapy for the case that their
biological psyche of a human has started to endanger their well-being (Bohannon, 2015). It is difficult if
not impossible to argue that knowledge is in the sole domain of human agents.
This being said, there are several difference and similarities when humans and machines handle
knowledge. One omnipresent characteristic of human knowledge is so-called tacitness of many aspects of
knowledge. Economists are quick to point to the fact that not all the details of all knowledge structures
are understood, articulable, or readily tractable. An agent might not be aware or might not be able to
articulate the details of a procedure, which is then referred to as “tacit knowledge” (Polanyi, 1966). It is
important to recognize that this does not change the fact that some kind of neural, social, or natural
pattern underpins the executed procedure. At the end, some kind of result is repeatedly achieved through
a certain (hidden) process. Information theorists and computer scientists speak about ‘hidden states’ and
frequently model them with ‘hidden Markov models’, which imply that there are some states and
dynamics that are not captured by the model. These are then modelled with probabilistic uncertainty (if
they would be known, their deterministic algorithm would be revealed).
A related characteristic is that knowledge does not always have to be optimized, but can also just be
a rough approximated description of something. Behavioral economists often point out that some
procedures are not optimal solutions for a specific task. These are not referred to as pure knowledge, but
merely approximate heuristics (Tversky and Kahneman, 1974). This also does not change the fact that
some patterns does exist. Most practical computer algorithms work with shortcutting heuristics that
provide an acceptable solution in return for speed and lower computational intensity. Much in line how
human decision makers often weigh the time to make a decision with the quality of the decision
(Kahneman, 2013), the goal is not to be perfect, but to be fit enough for our messy environment. The
artificial intelligence community has long embraced this insight, for example through so-called “probably
approximately correct (PAC) learning” (Valiant, 1984).
Another characteristic refers to the origin of new knowledge. An important body of social science
literature asks where new, innovative insights come from. The general consensus is that new knowledge
emerges from the recombination of existing knowledge (e.g. Poincare, 1908; Schumpeter, 1939;
Weitzman, 1998; Antonelli et al., 2010; Tria et al., 2014; Youn et al., 2015), be it by inductive observation
or deductive inference. This notion is certainly very familiar to everybody who has written ‘new
algorithms’ by ‘copy-pasting’ existing modules, subroutines, or callable functions. Both, knowledge in
biological and artificial systems grows by recombination of previous knowledge.
Key takeaway: From a theoretical perspective, it is unjustified to argue that human and
artificial knowledge are inherently different, which does not change the fact that both can
have similar or different characteristics.
This all being said, it is important to notice that Kolmogorov’s descriptive complexity of an algorithm
in terms of its length is a very broad and generalized notion. It is not even limited to structures and
31
[email protected]
Nov. 2015
patterns of humans or machines. In essence, it can be applied to any kind of structure and pattern,
including those from nature. Rain and sunshine provide patterns, as do earthquakes, and the structure of
a rock. This means, we can measure the information content of a rock by measuring its descriptive
complexity. What is the shortest program required to describe this specific rock? What is its Kolmogorov
complexity? How much knowledge do we require to recreate this rock exactly as is (think about a 3D
printer)? The use of tools from information theory and computer science is actually one of the most
promising current approaches to study complex systems, including crystals and molecules (e.g. Varn &
Crutchfield, 2015a; 2015b).
32
[email protected]
Nov. 2015
An ‘amazing’ and ‘beautiful’ fact
Hardcore engineers are a serious crowd of people, especially theoretical ones. You do not have a lot
of room for flamboyant excitement when you design critical bridges, airplanes, or the world’s life
sustaining communication systems. Any exaggeration will quickly be exposed by catastrophic failure with
real-world consequences. As a result, they tend to be exceptionally conservative in their judgement. For
example, when they use the term “almost certain”, they usually do not refer to a 1:10 chance like you and
me would, but would rather use to it refer to a chance of 1 to a number that is bigger than the number of
seconds since the big bang. When they refer to “large numbers” they usually mean infinity, but do not
want to say it, because they cannot guarantee that it is always truly infinite. Nowhere else but in the
following have I personally ever seen that they lose their chronic cool and excitedly exclaim that
something is “amazing” (Cover and Thomas, 2006; p. 463) and “beautiful” (Li and Vitanyi, 1997; p. 187).
What elicits this excitement is that fact that Shannon’s probabilistic- and Kolmogorov’s deterministic
approaches are equivalent on average. They turn out to be two sides of the same coin. In the words of
Cover and Thomas (2006), who wrote the standard text book on information theory: “It is an amazing fact
that the expected length of the shortest binary computer description of a random variable is
approximately equal to its entropy” (p. 463), and in the words of Li and Vitanyi (1997), who wrote the
standard textbook on Kolmogorov complexity: “It is a beautiful fact that these two notions turn out to be
much the same” (p. 187).
Key takeaway: On average over long descriptions (asymptotically), Shannon probabilistic
information is equivalent to the amount of knowledge required by Kolmogorov’s approach
to describe the object.
While the formal mathematical proof of this is quite subtle, the underlying reasoning is actually quite
intuitive. Shannon’s probabilistic approach is like answering binary questions, for example about the
location of a destination on a map (‘Is it North or South?’; ‘Is it East or West?’, ‘Is it East or West at the
crossing?’). This is in agreement with the epistemological idea that “information is contained in answers
to questions” (Bernstein, 2011; also Ackoff, 1989). More precisely, one bit reduces the uncertainty of the
answer to a question by half. Kolmogorov’s algorithmic approach is like describing the route to a
destination on a map (e.g. ‘Go North, then West, if at crossing, then turn East, etc.’). The equivalence says
that both ways of defining the final destination requires on average the same number of symbols. To
understand why it is important to remember what it means to describe something. First and foremost
describing ‘some thing’ requires to distinguish every relevant aspect of the object from all other
alternative ‘tings’. In other words, our description contrasts this object from all other possible alternatives
of this object by precluding what the object is not. In Shannon’s sense, the city of Davis in Northern
California can be identified by identifying it from within a probability space that selects it from all other
cities (most fundamentally by a series of binary choices that reduce uncertainty by half). In Kolmogorov’s
sense, it can be identified by giving deterministic instructions of how to bypass all other cities to end up
in Davis (most efficiently by giving maximal explanatory power with every instructional symbol). On
average, they should turn out to result in the same amount of information or instructions. Otherwise, one
33
[email protected]
Nov. 2015
would be superior to the other, which would motivate us to mainly use one and not the other approach.
But we use both in different cases.
Key takeaway: Both Kolmogorov complexity and Shannon’s information are linked through
being the counterpart of randomness and uncertainty:
no knowledge ≈ no pattern ≈ randomness ≈ uncertainty ≈ no information
The formal proof that Shannon’s probabilistic and Kolmogorov’s algorithmic definitions of information
are equivalent is quite subtle and goes back to Kolmogorov himself (1968), being fleshed out by Zvonkin
and Levin (1970). It has been popularized in the information theory community by Leung-Yan-Cheong and
Cover (1978) (see also the standard textbook on information theory, Cover and Thomas, 2006, Ch. 14.3)
and among computer scientists and physicists by a series of papers that eventually exorcised Maxwell’s
(1871) 120 year old demon by the hands of Szilard (1929), Bennett (1982) and especially Zurek (1989) (see
also Caves, 1990; Zurek, 1998, and for an introduction see Baeyer, 1999).
Proof outline of the information-knowledge equivalence
The equivalence theorem says that both metrics are equivalent on average, over long strings, or, to
use a technical term, asymptotically, that is arbitrarily close as the limit is taken. Shannon’s metric
naturally refers to the average case (being based on probabilities that arise from many different cases).
To obtain an average version of Kolmogorov’s specific approach we could list them all. So for example, if
we have a certain dynamic (for example rain or sunshine over three weeks), we could list all possible
strings over the involved 20 days (much in line with the example above) and compress them all. This would
give us the average length of the string, according to our code. There are 220 = 1,048,576 possible strings
of 20 binary symbols. This would fill up around 125 books of 250 page each, with one string listed in each
line. We would then compare the average code length after compression to the average amount of
uncertainty that is reduced when Shannon’s information provides answers to questions about the same
pattern over 20 periods.
We will follow the same idea, just on a smaller scale and without using a black-box compressor, so
we can track what is going on (the outline of the proof is based on Caves, 1990). First, we specify a program
that is able to produce all possible strings with a given length. For example, let us work with a binary string
of length 4 (keeping it simpler than length 20). We write a program that lists all possible combinations of
four binary symbols4. This gives us the 16 different strings presented in the second column in the Table.
We call this program that produces this table 𝐾(𝑡), simply to give it a name (think Kolmogorov complexity
of the table). In order to produce our program 𝐾(𝑡) we need to specify each of the 16 string, which
requires log2 16 = 4 𝑏𝑖𝑡𝑠. Additionally we also need to specify how long the string is. This is important
because it tells the program when to stop or ‘halt’ before some other string might start.5 The length of
For example according to the binomial coefficient function ( 4𝑘 ), “4 choose k”, with k being is the number of (1)s.
5 While this might seem like a trivial add-on, the problem of ‘halting’ is at the the heart of theoretical computer. It goes back to
the fundamental Entscheidungsproblem (decision-problem) of David Hilbert. In 1900 he asked if general algorithms exist to
4
34
[email protected]
Nov. 2015
our string is 4 in this case, so to declare that the string is 4 symbols long, we require additional log2 4 =
2 𝑏𝑖𝑡𝑠 to define this length. Therefore, the program 𝐾(𝑡) has 4 bits + 2 bits = 6 bits. After writing the table
with all possibilities we assign a number i to each string (first column in the Table). If we want to identify
a particular string, we simply need to provide the corresponding code number i. For example, i = 6
identifies code 1100. We call this program that specifies the string 𝐾(𝑠), simply to give it a name (think
Kolmogorov complexity of the string). How many symbols do we need to encode 𝐾(𝑠)?
We again follow Shannon’s logic. If all strings are equally likely, picking one string takes one out of
16 equally likely choices, and therefore as well log2 16 = 4 𝑏𝑖𝑡𝑠. 𝐾(𝑠) has therefore 4 bits. If the strings
are not equally likely, we can adjust for that with Shannon’s formula and compress this to less than 4 bits
(following Shannon’s source coding theorem). Now we are in a position to clearly identify our string with
two consecutive steps: first we create the entire table of all possible
i
Different
k: #of 1s
strings with program 𝐾(𝑡), and second we identify the specific
strings
string that we are interested in with program 𝐾(𝑠). In our case this
1
0000
0
requires 𝐾(𝑡) + 𝐾(𝑠) = 6 𝑏𝑖𝑡𝑠 + 4 𝑏𝑖𝑡𝑠 = 10 𝑏𝑖𝑡𝑠. It might be
2
1000
1
inefficient to describe a string of four binary symbols with 10 bits,
3
0100
1
but it will become beneficial once the strings get longer. For now,
4
0010
1
this example shows a valid two-step procedure to specify any
5
0001
1
specific string of arbitrary length from scratch: we first define the
6
1100
2
entire set of possible strings with a certain length, and then specify
7
1010
2
which string of this set we are interested in.
Now, 𝐾(𝑡) and 𝐾(𝑠) might have some things in common. For
example, 𝐾(𝑡) includes information about how long the string is,
and 𝐾(𝑠) lists a string with a given length. We might be able to save
something here by compression of redundancy among both
programs. So when we construct the program 𝐾(𝑠) we make sure
that we use all the code that is already included in the table t
produced by program 𝐾(𝑡), making use of what they have in
common. The result can be shorter than the previous two separate
programs (if there is synergy):
8
9
10
11
1001
0110
0101
0011
2
2
2
2
12
13
14
15
0111
1011
1101
1110
3
3
3
3
16
1111
4
𝐾(𝑡) + 𝐾(𝑠) ≥ 𝐾(𝑡) + (𝑝𝑟𝑜𝑔𝑟𝑎𝑚 (𝑠), 𝑔𝑖𝑣𝑒𝑛 𝑤ℎ𝑎𝑡 𝑖𝑠 𝑎𝑙𝑟𝑒𝑎𝑑𝑦 𝑖𝑛 𝑡) = 𝐾(𝑡) + 𝐾(𝑠|𝑡)
This gives us an upper limit on Kolmogorov’s algorithmic measure that does not involve
probabilities: we need at least the number of bits to list the entire table and the number of bits to specify
the string of interest, and can take out what they have in common. Until here there are no probabilities
involved.
The program 𝐾(𝑡) that creates the table appears on both sides of the preceding inequality, and can
be cancelled. This means that we know take the table as given. This results in the intuitive fact that 𝐾(𝑠)
is larger than the specification of the number when knowing the lookup table:
𝐾(𝑠) ≥ 𝐾(𝑠|𝑡)
(roughly speaking) solve mathematics. The answer was negative and provided by the logician Kurt Gödel in the 1930s and by
two of the most influential founders of theoretical computer science, Alan Turing and Alonzo Church.
35
[email protected]
Nov. 2015
This applies to the specification of one specific string. Let us ask about the average string: how big
is the program for the average string? This depends on how often each string appears. On the one
extreme, we have 16 equally likely strings, and on the other extreme we get that only one string occurs
while 15 never occur. The following table shows these two cases and an intermediate case where (0)s
have a 25 % chance of showing up and (1)s a 75 % chance (which affects the probability of the different
strings, which are combinations of (0)s and (1)s). Weighting the program size by these weights we get:
∑ 𝑝 ∗ 𝐾(𝑠) ≥ ∑ 𝑝 ∗ 𝐾(𝑠|𝑡)
𝑠
𝑠
How many bits do we need to represent the average program 𝐾(𝑠) most effectively? This also
depends on the probabilities of the different strings. If we do not have the table and do not know the
probabilities of the 16 strings, we have to go with Laplace’s Principle of Indifference and Jaynes’ MaxEnt
principle and assume that there are uniformly distributed. This would require log2 16 = 4 𝑏𝑖𝑡𝑠. On the
other extremely, if only one string is valid, we require log2 1 = 0 𝑏𝑖𝑡𝑠. For the in-between case from the
table we require ≈ 3.245 𝑏𝑖𝑡𝑠. So knowing the table, we can use Shannon’s source coding theorem and
see if we can compress 𝐾(𝑠) down to its entropy. The entropy without having the table that reveals the
disgibution 𝐻(𝑠) is larger than the average uncertainty given the table 𝐻(𝑠|𝑡):
𝐸[𝐾(𝑠)] = 𝐻(𝑠) ≥ 𝐻(𝑠|𝑡) ≥ 𝐸[𝐾(𝑠|𝑡)]
i
Different
strings
k: #of 1s
Probabilities
p(0) = 0.5; p(1) = 0.5
1
0000
0
½ * ½ *½ *½ = 1/16
1
Probabilities
p(0) = 0.25; p(1) = 0.75
Probabilities
p(0) = 0; p(1) = 1
/4 * 1/4 * 1/4 * ¼ = 0.004
0*0*0*0=0
1
1
1
1
2
3
4
5
1000
0100
0010
0001
1
1
1
1
½ * ½ *½ *½ = /16
½ * ½ *½ *½ = 1/16
½ * ½ *½ *½ = 1/16
½ * ½ *½ *½ = 1/16
¾ * /4 * /4 * /4 = 0.012
1
/4 * ¾ * 1/4 * 1/4 = 0.012
1
/4 * 1/4 * ¾ * 1/4 = 0.012
1
/4 * 1/4 * 1/4 * ¾ = 0.012
1*0*0*0=0
0*1*0*0=0
0*0*1*0=0
0*0*0*1=0
6
7
8
9
10
11
1100
1010
1001
0110
0101
0011
2
2
2
2
2
2
½ * ½ *½ *½ = 1/16
½ * ½ *½ *½ = 1/16
½ * ½ *½ *½ = 1/16
½ * ½ *½ *½ = 1/16
½ * ½ *½ *½ = 1/16
½ * ½ *½ *½ = 1/16
¾ * ¾ * 1/4 * 1/4 = 0.035
¾ * 1/4 * ¾* 1/4 = 0.035
¾ * 1/4 * 1/4 * ¾ = 0.035
1
/4 * ¾ *¾ * 1/4 = 0.035
1
/4 * ¾ * 1/4 * ¾ = 0.035
1
/4 * 1/4 * ¾ * ¾ = 0.035
1*1*0*0=0
1*0*1*0=0
1*0*0*1=0
0*1*1*0=0
0*1*0*1=0
0*0*1*1=0
12
13
14
15
0111
1011
1101
1110
3
3
3
3
½ * ½ *½ *½ = 1/16
½ * ½ *½ *½ = 1/16
½ * ½ *½ *½ = 1/16
½ * ½ *½ *½ = 1/16
1
/4 * ¾ * ¾ * ¾ = 0.105
¾ * 1/4 * ¾ * ¾ = 0.105
¾ * ¾ * 1/4 * ¾ = 0.105
¾ * ¾ * ¾ * 1/4 = 0.105
0*1*1*1=0
1*0*1*1=0
1*1*0*1=0
1*1*1*0=0
16
1111
4
½ * ½ *½ *½ = 1/16
¾ * ¾ * ¾ * ¾ = 0.316
1*1*1*1=1
4 𝑏𝑖𝑡𝑠
≈ 3.245 𝑏𝑖𝑡𝑠
0 𝑏𝑖𝑡𝑠
Bits required for 𝐾(𝑠) = 𝐻(𝑠)
Having an upper limit, we now ask about a lower average limit. Well, cutting a long story short,
Shannon’s source coding theorem tells us that we cannot compress the specification of the string given
the table, 𝐾(𝑠|𝑡), below its entropy. This defines a lower limit at:
36
[email protected]
Nov. 2015
𝐸[𝐾(𝑠|𝑡)] ≥ 𝐻(𝑠|𝑡)
Combining both inequalities, the one for the upper- and the one for the lower limit, gives us:
𝐻(𝑠|𝑡) ≥ 𝐸[𝐾(𝑠|𝑡)] ≥ 𝐻(𝑠|𝑡)
In words, the number of bits that is required to identify a object through description on average,
𝐸[𝐾(𝑠|𝑡)], is smaller than the entropy of the string given the table and larger than the entropy of the
string given the table. In mathematical terms we say that this double inequality ‘is tight’ (they approach a
double equality).
This reveals a subtle, and often neglected, but very important point. In order to apply Shannon’s
compression logic we require a lookup table like the one we have here, or some kind of equivalent that
provides us with the probability distribution of the source. Only then can we use our ingenious coding
schemes for compression. Both Shannon’s theoretical entropy formula and Huffman’s practical coding
application require the input of the probability distribution. Both the conditional Kolmogorov complexity
and Shannon entropy are conditioned: both are conditioned on the fact that we know the underlying
probability distribution specified by the table. Returning to Shannon’s own analogy of the twins, the
“twins” (MPEG, JPEG, ZIP, AVI, etc.) are aware of the distribution and incorporate it in order to compress
the message. Entropy expresses the remaining uncertainty, knowing the table. Describing the ‘table’ that
specifies the involved probabilities requires a program, which we called 𝐾(𝑡). This consists of algorithms
in Kolmogorov’s sense. In Kolmogorov’s case, we have been very explicit and derived the conditional
length of the program given the table. Once we account for the fact that both are conditional on ‘the
table’, it turns out that both approach each other.
Key takeaway: Since Shannon’s measure of information considers what is already known
about the underlying probability distribution, it requires the algorithmic description of the
underling probability distribution.
Some after-thoughts: the mixed use of information and knowledge…
Being two sides of the same coin, both quantities also complement each other. For example, when
studying complex systems, part of the systems dynamic can be described in terms of deterministic
structure and the other part terms of probabilistic uncertainty. This duality is succinctly expressed in
familiar concepts of modern complexity science, such as ‘algorithmic randomness’ (Zurek, 1989),
‘statistical mechanics’ and ‘deterministic chaos’ (Crutchfield, 2012). These terms imply that some part of
system’s dynamic is algorithmic/mechanical/deterministic, while at the same partially being
random/statistical/chaotic. One part refers to the part that is known about the system (deterministic,
without doubt), the other one to the part that is unknown (probabilistic, in the worst case maximum
entropy/uncertainty) (Shalizi and Crutchfield, 2001; Crutchfield and Feldman, 2003).
The same logic is at the heart of the argument that finally exorcised Maxwell’s (1872) notorious
demon by the hands of Szilard (1929), Bennett (1982) and Zurek (1989). Zurek (1989) showed that the
demon can either ‘know’ the position of a particle to extract energy (through an algorithm), or it can make
an uncertainty reducing observation (through informational bits). Both have the same net effect.
37
[email protected]
Nov. 2015
Also digital technologies make complementary use of both. For example, in order for the sending and
receiving agent to ‘know’ that an incoming symbol has a 50%-50% chance (or a 20%-80% chance, etc.), it
requires a ‘lookup table’ with the corresponding probabilities. At the very least, the definition of
uncertainty always requires a normalized probability space (normalizing probabilities between 0 and 1).
In other words, every probability is conditioned on its underlying probability space (defining the number
of possible events and their probabilities). In modern digital applications these encoding and decoding
‘lookup tables’ are an integral part of technological standards like ZIP, MPEG, JPEG, MP3, CDMA, or UMTS,
ATSC, etc. Only through this ‘lookup table’ can the sender distinguish plain data from uncertainty-reducing
information and choose to only send information (through compression), while neglecting redundant
data. The existence of the ‘lookup table’ requires code in form of an algorithm. This algorithm might create
a dynamic lookup table (such as with Lempel-Ziv compression), but it is nonetheless and algorithm that
can be quantified in terms of Kolmogorov complexity (see Caves, 1990). For example, first you have to
know that there are 4 events that are equally likely (this is the given ‘lookup table’). Then two consecutive
bits can allow to identify the chosen one of the 4 choices.
The complementary nature of both is also an integral part of most existing social science definitions
of information and knowledge. It also provides a clear theoretical justification for how to distinguish
between both, which leads to much confusion in the social science literature (e.g. Cowan et al., 2000). For
example, economist state that “it is difficult to argue that explicit knowledge, recorded in documents and
information systems, is any more or less than information” (Rowley, 2007; p. 175). Our formal definitions
show how to distinguish among them: one is probabilistic, the other one deterministic.
Saviotti (1998) explains that “particular pieces of information can be understood only in the context
of a given type of knowledge” (p. 845); Jensen et al. (2007) point out that “in order to understand
messages about the world you need to have some prior knowledge about it (p. 681); and Cowan et al.
(2000) explain that “it is the cognitive context afforded by the receiver that imparts meaning(s) to the
information-message… The term ‘knowledge’ is simply the label affixed to the state of the agent’s entire
cognitive context” (p.216). While these notions invoke sophisticated analogies, in the simplest case, this
‘cognitive context’ is a simple ‘lookup table’ (a dictionary or codebook) that allows to distinguish between
data and uncertainty-reducing information and to de-codify the incoming symbols. “Thus, initial
codification activity involves creating the specialized dictionary” (Cowan et al., 2000; p.225).
Their complementarity has also economic relevance. Algorithms allow to infer information that is part
of the algorithmic structure without the need for further measurement (observation). This is
advantageous for prediction of the future (where empirical observations are impossible) or “if calculation
costs are lower than measurement costs” (Saviotti, 2004; p. 104). In this sense, an agent can choose to
obtain the information ‘bit by bit’ through observation (e.g. when it is enough to react to the clouds in the
sky) or invest in creating an algorithm that allow for predicting the clouds in the sky. While on average the
same amount of information and knowledge are required, one can option be more valuable than the other
for particular cases...
38
[email protected]
Nov. 2015
References
Ackoff, R. L. (1989). From Data to Wisdom. Journal of Applied Systems Analysis, (16), 3–9.
Antonelli, C., Krafft, J., & Quatraro, F. (2010). Recombinant knowledge and growth: The case of ICTs. Structural Change
and Economic Dynamics, 21(1), 50–69. http://doi.org/10.1016/j.strueco.2009.12.001
Baeyer, H. C. V. (1999). Warmth Disperses and Time Passes: The History of Heat. Modern Library.
Bakshy, E., Messing, S., & Adamic, L. A. (2015). Exposure to ideologically diverse news and opinion on Facebook. Science,
348(6239), 1130–1132. http://doi.org/10.1126/science.aaa1160
Bateson, G. (1972). Steps to an Ecology of Mind. Random House, New York.
Bennett, C. H. (1982). The thermodynamics of computation—a review. International Journal of Theoretical Physics, 21,
905–940.
Bernstein, J. H. (2011). The Data-Information-Knowledge-Wisdom Hierarchy and its Antithesis. NASKO, 2(1), 68–75.
http://doi.org/10.7152/nasko.v2i1.12806
Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). Near Shannon limit error-correcting coding and decoding: Turbocodes. 1. In Technical Program, Conference Record, IEEE International Conference on Communications, 1993. ICC
’93 Geneva (Vol. 2, pp. 1064–1070 vol.2). http://doi.org/10.1109/ICC.1993.397441
Bohannon, J. (2015). The synthetic therapist. Science, 349(6245), 250–251. http://doi.org/10.1126/science.349.6245.250
Boltzmann, L. (1872). Weitere Studien über das Wärmegleichgewicht unter Gasmolekülen. Sitzungsberichte Akademie
Der Wissenschaften, 66, 275–370.
Boyd, R., & Richerson, P. J. (2005). The Origin and Evolution of Cultures (First Edition). Oxford University Press, USA.
Brookshear, J. G. (2009). Computer Science: An Overview (10th ed.). Addison Wesley.
Cacioppo, J. T., Cacioppo, S., Gonzaga, G. C., Ogburn, E. L., & VanderWeele, T. J. (2013). Marital satisfaction and breakups differ across on-line and off-line meeting venues. Proceedings of the National Academy of Sciences, 110(25),
10135–10140. http://doi.org/10.1073/pnas.1222447110
Caves, C. (1990). Entropy and Information: How much information is needed to assign a probability? In W. H. Zurek (Ed.),
Complexity, Entropy and the Physics of Information (pp. 91–115). Oxford: Westview Press.
Chaitin, G. J. (1966). On the Length of Programs for Computing Finite Binary Sequences. Journal of the ACM (JACM), 13,
547–569. http://doi.org/http://doi.acm.org.libproxy.usc.edu/10.1145/321356.321363
Chun, A. H. W., & Suen, T. Y. T. (2014). Engineering Works Scheduling for Hong Kong’s Rail Network. In Twenty-Sixth IAAI
Conference. Retrieved from http://www.aaai.org/ocs/index.php/IAAI/IAAI14/paper/view/8151
Cohendet, P., & Steinmueller, W. (2000). The Codification of Knowledge: A Conceptual and Empirical Exploration.
Industrial and Corporate Change, 9(2), 195–209.
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd Edition). Hoboken, NJ: Wiley-Interscience.
Cowan, R., David, P. A., & Foray, D. (2000). The explicit economics of knowledge codification and tacitness. Industrial and
Corporate Change, 9(2), 211–253. http://doi.org/10.1093/icc/9.2.211
Crutchfield, J., & Feldman, D. (2003). Regularities unseen, randomness observed: Levels of entropy convergence. Chaos:
An Interdisciplinary Journal of Nonlinear Science, 13(1), 25–54.
Crutchfield, J. P. (2012). Between order and chaos. Nature Physics, 8(1), 17–24. http://doi.org/10.1038/nphys2190
Gell-Mann, M. (1995). The Quark and the Jaguar: Adventures in the Simple and the Complex. New York: St. Martin’s
Griffin.
Gibbs, J. W. (1873). A Method of Geometrical Representation of the Thermodynamic Properties of Substances by Means
of Surfaces. Transactions of the Connecticut Academy, II, 382–404.
Gleick, J. (2011). The Information: A History, a Theory, a Flood. New York: Pantheon.
Hannak, A., Soeller, G., Lazer, D., Mislove, A., & Wilson, C. (2014). Measuring Price Discrimination and Steering on Ecommerce Web Sites. In Proceedings of the 14th ACM/USENIX Internet Measurement Conference (IMC’14).
Vancouver, Canada. Retrieved from http://personalization.ccs.neu.edu/PriceDiscrimination/Research/
39
[email protected]
Nov. 2015
Hartley, R. V. L. (1928). Transmission of Information. Bell System Technical Journal, Presented at the International
Congress of Telegraphy and Telephony, Lake Como, Italy, 1927, 535–563.
Hilbert, M. (forthcoming). Bad news first: the digital access divide is here to stay; national bandwidths among 172
countries for 1986 - 2014.
Hilbert, M. (2014a). How much of the global information and communication explosion is driven by more, and how much
by better technology? Journal of the Association for Information Science and Technology, 65(4), 856–861.
http://doi.org/10.1002/asi.23031
Hilbert, M. (2014b). What Is the Content of the World’s Technologically Mediated Information and Communication
Capacity: How Much Text, Image, Audio, and Video? The Information Society, 30(2), 127–143.
http://doi.org/10.1080/01972243.2013.873748
Hilbert, M., & López, P. (2011). The World’s Technological Capacity to Store, Communicate, and Compute Information.
Science, 332(6025), 60 –65. http://doi.org/10.1126/science.1200970
Hilbert, M., & López, P. (2012). How to Measure the World’s Technological Capacity to Communicate, Store and
Compute Information? Part II: measurement unit and conclusions. International Journal of Communication, 6, 936–
955.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., … Kingsbury, B. (2012). Deep Neural Networks for
Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing
Magazine, 29(6), 82–97. http://doi.org/10.1109/MSP.2012.2205597
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science,
313(5786), 504–507. http://doi.org/10.1126/science.1127647
Hodson, H. (2014, July 7). The AI boss that deploys Hong Kong’s subway engineers. New Scientist, (2976). Retrieved from
https://www.newscientist.com/article/mg22329764-000-the-ai-boss-that-deploys-hong-kongs-subway-engineers/
Huffman, D. (1952). A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE, 40(9), 1098–
1101. http://doi.org/10.1109/JRPROC.1952.273898
Huffman, K. (2010). Huffman Algorithm. Retrieved November 20, 2015, from http://www.huffmancoding.com/my-uncle
James, R. G., Ellison, C. J., & Crutchfield, J. P. (2011). Anatomy of a bit: Information in a time series observation. Chaos:
An Interdisciplinary Journal of Nonlinear Science, 21(3), 037109. http://doi.org/10.1063/1.3637494
Jaynes, E. T. (1957a). Information Theory and Statistical Mechanics. Physical Review, 106(4), 620.
http://doi.org/10.1103/PhysRev.106.620
Jaynes, E. T. (1957b). Information Theory and Statistical Mechanics. II. Physical Review, 108(2), 171.
http://doi.org/10.1103/PhysRev.108.171
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. (G. L. Bretthorst, Ed.) (1 edition). Cambridge, UK ; New York,
NY: Cambridge University Press.
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255–
260. http://doi.org/10.1126/science.aaa8415
Kahneman, D. (2013). Thinking, Fast and Slow (Reprint edition). Farrar, Straus and Giroux.
Keynes, J. M. (1921). A Treatise on Probability. Macmillan and Company, limited.
Kolmogorov, A. (1941). Interpolation und Extrapolation von stationären zufälligen Folgen. Bulletin Ofthe Academy of
Sciences of the USSR, Series on Mathematics(5), 3–14.
Kolmogorov, A. N. (1968). Three approaches to the quantitative definition of information. International Journal of
Computer Mathematics, 2(1-4), 157–168. http://doi.org/10.1080/00207166808803030
Kuc, R. (1999). The Digital Information Age: An Introduction to Electrical Engineering (1 edition). Boston: Brooks/Cole
Publishing.
Kuhn, T. S. (2012). The Structure of Scientific Revolutions: 50th Anniversary Edition (Fourth Edition edition). University Of
Chicago Press.
Leung-Yan-Cheong, S., & Cover, T. (1978). Some equivalences between Shannon entropy and Kolmogorov complexity.
Information Theory, IEEE Transactions on, 24(3), 331–338.
40
[email protected]
Nov. 2015
Li, M., & Vitanyi, P. (1997). An Introduction to Kolmogorov Complexity and Its Applications (2nd ed.). New York: Springer.
López, P., & Hilbert, M. (2012). Methodological and Statistical Background on The World’s Technological Capacity to
Store, Communicate, and Compute Information (online document). Retrieved from
http://www.webcitation.org/6bZqZEVGR.
Martin-Löf, P. (1966). The definition of random sequences. Information and Control, 9(6), 602–619.
http://doi.org/10.1016/S0019-9958(66)80018-9
Massey, J. L. (1998). Applied Digital Information Theory: Lecture Notes by Prof. em. J. L. Massey. Swiss Federal Institute of
Technology. Retrieved from http://www.isiweb.ee.ethz.ch/archive/massey_scr/
Maxwell, J. C. (1872). Theory of heat. Westport, Conn., Greenwood Press. Retrieved from
http://www.archive.org/details/theoryheat02maxwgoog
Minsky, M. (1986). Society Of Mind. Simon and Schuster.
Mitchell, M. (1998). An Introduction to Genetic Algorithms. Cambridge, MA, USA: MIT Press.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … Hassabis, D. (2015). Human-level control
through deep reinforcement learning. Nature, 518(7540), 529–533. http://doi.org/10.1038/nature14236
Moore, G. E. (1995). Lithography and the future of Moore’s law. In Proceedings of SPIE (international society for optics
and photonics) (pp. 2–17). Santa Clara, CA, USA. http://doi.org/10.1117/12.209195
Nelson, R. R., & Winter, S. G. (1985). An Evolutionary Theory of Economic Change. Belknap Press of Harvard University
Press.
Pierce, J. R. (1980). An Introduction to Information Theory (2nd Revised ed.). New York, NY: Dover Publications.
Polanyi, M. (1966). The Tacit Dimension. University of Chicago Press.
Powell, W. W., & DiMaggio, P. J. (1991). The New Institutionalism in Organizational Analysis (1st ed.). University Of
Chicago Press.
Ramchurn, S. D., Vytelingum, P., Rogers, A., & Jennings, N. R. (2012). Putting the “smarts” into the smart grid: a grand
challenge for artificial intelligence. Communications of the ACM, 55(4), 86.
http://doi.org/10.1145/2133806.2133825
Rowley, J. (2007). The wisdom hierarchy: representations of the DIKW hierarchy. Journal of Information Science, 33(2),
163–180. http://doi.org/10.1177/0165551506070706
Russell, S., & Norvig, P. (2009). Artificial Intelligence: A Modern Approach (3rd ed.). Prentice Hall.
Saviotti, P. P. (1998). On the dynamics of appropriability, of tacit and of codified knowledge. Research Policy, 26(7-8),
843–856.
Schaeffer, J., Burch, N., Björnsson, Y., Kishimoto, A., Müller, M., Lake, R., … Sutphen, S. (2007). Checkers Is Solved.
Science, 317(5844), 1518–1522. http://doi.org/10.1126/science.1144079
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.
http://doi.org/10.1016/j.neunet.2014.09.003
Schumpeter, J. (1939). Business Cycles: A Theoretical, Historical, And Statistical Analysis of the Capitalist Process. New
York: McGraw-Hill. Retrieved from
http://classiques.uqac.ca/classiques/Schumpeter_joseph/business_cycles/schumpeter_business_cycles.pdf
Shalizi, C. R., & Crutchfield, J. P. (2001). Computational Mechanics: Pattern and Prediction, Structure and Simplicity.
Journal of Statistical Physics, 104(3-4), 817–879. http://doi.org/10.1023/A:1010388907793
Shannon, C. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27, 379–423, 623–656.
http://doi.org/10.1145/584091.584093
Shannon, C. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal, 30, 50–64.
Siegwart, R., Nourbakhsh, I. R., & Scaramuzza, D. (2011). Introduction to Autonomous Mobile Robots. MIT Press.
Solomonoff, R. J. (1960). A preliminary report on a general theory of inductive inference (Technical Report ZTB-138).
Cambridge, MA: Zator Company.
41
[email protected]
Nov. 2015
Solomonoff, R. J. (1964a). A formal theory of inductive inference. Part I,. Information and Control, 7(1), 1–22.
http://doi.org/10.1016/S0019-9958(64)90223-2
Solomonoff, R. J. (1964b). A formal theory of inductive inference. Part II,. Information and Control, 7(2), 224–254.
http://doi.org/10.1016/S0019-9958(64)90131-7
Szilard, L. (1929). über die Entropieverminderung in einem thermodynamischen System bei Eingriffen intelligenter
Wesen. Zeitschrift Für Physik A Hadrons and Nuclei, 53(11), 840–856. http://doi.org/10.1007/BF01341281
Tria, F., Loreto, V., Servedio, V. D. P., & Strogatz, S. H. (2014). The dynamics of correlated novelties. Scientific Reports, 4.
http://doi.org/10.1038/srep05890
Tribus, M., & McIrvine, E. C. (1971). Energy and Information. Scientific American, 224, 178–184.
Tversky, A., & Kahneman, D. (1974). Judgment under Uncertainty: Heuristics and Biases. Science, 185(4157), 1124 –1131.
http://doi.org/10.1126/science.185.4157.1124
UCSF (University of California San Francisco). (2011, November 16). UCSF Automated Pharmacy Wins 2011 Popular
Science “Best of What’s New” Award. Retrieved from http://www.ucsf.edu/news/2011/11/10976/ucsfautomated-pharmacy-wins-2011-popular-science-best-whats-new-award
Valiant, L. G. (1984). A Theory of the Learnable. Commun. ACM, 27(11), 1134–1142. http://doi.org/10.1145/1968.1972
Varn, D. P., & Crutchfield, J. P. (2015). Chaotic crystallography: how the physics of information reveals structural order in
materials. Current Opinion in Chemical Engineering, 7, 47–56. http://doi.org/10.1016/j.coche.2014.11.002
Varn, D. P., & Crutchfield, J. P. (2015). What did Erwin Mean? The Physics of Information from the Materials Genomics of
Aperiodic Crystals and Water to Molecular Information Catalysts and Life. arXiv:1510.02778 [cond-Mat,
Physics:nlin, Q-Bio]. Retrieved from http://arxiv.org/abs/1510.02778
Weitzman, M. L. (1998). Recombinant Growth. The Quarterly Journal of Economics, 113(2), 331–360.
Wheeler, J. (1990). Information, Physics, Quantum: The search for Links. In W. H. Zurek (Ed.), Complexity, Entropy and
the Physics of Information (pp. 3–28). Oxford: Westview Press.
Wiener, N. (1948). Cybernetics; or, Control and communication in the animal and the machine. J. Wiley.
Wu, Y., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are more accurate than those made
by humans. Proceedings of the National Academy of Sciences, 201418680.
http://doi.org/10.1073/pnas.1418680112
Yegnanarayana, B. (2004). Artificial Neural Networks. New Delhi: Prentice-Hall of India Pvt.Ltd.
Yeung, R. W. (1991). A new outlook on Shannon’s information measures. IEEE Transactions on Information Theory, 37(3),
466–474. http://doi.org/10.1109/18.79902
Youn, H., Strumsky, D., Bettencourt, L. M. A., & Lobo, J. (2015). Invention as a combinatorial process: evidence from US
patents. Journal of The Royal Society Interface, 12(106), 20150272. http://doi.org/10.1098/rsif.2015.0272
Zeleny, M. (1986). Management Support Systems: Towards Integrated Knowledge Management. Human Systems
Management, 7(1), 59–70.
Zurek, W. H. (1989). Algorithmic randomness and physical entropy. Physical Review A, 40(8), 4731.
http://doi.org/10.1103/PhysRevA.40.4731
Zurek, W. H. (1998). Algorithmic randomness, physical entropy, measurements, and the Demon of Choice. arXiv:quantph/9807007. Retrieved from http://arxiv.org/abs/quant-ph/9807007
Zvonkin, A. K., & Levin, L. A. (1970). The complexity of finite objects and the development of the concepts of information
and randomness by means of the theory of algorithms. Russian Mathematics Surveys (Uspekhi Mat. Nauk), 25(6),
83–124.
42