Coding Theory
Emmanuel Abbe
1
Introduction
The field of coding theory emerged with the pioneering work of Claude E. Shannon “A Mathematical
Theory of Communication”, published in 1948 in the Bell System Technical Journal. In his work,
Shannon investigates both the transmission and compression of information. A probabilistic model
is proposed for the communication channel or for the source, and a mathematical measure is
proposed to quantify the amount of noise in the channel (capacity) or the amount of redundancy in
the source (entropy). In each setting, the unit used to measure the amount of noise or redundancy
is the bit (binary digit), a terminology attributed to John W. Tukey. The notion of code is also
developed to protect information from the noise or to compress redundant information. Shannon
then provides two monumental results: (1) the largest rate at which information can be reliably
transmitted is given by the channel capacity, (2) the largest rate at which information can be reliably
compressed is given by the source entropy. The notion of reliability requires formal definitions,
discussed below, but in words refers to essentially not loosing information with high probability.
These results are particularly powerful since they indicate at the same time what can be achieved
with communication and compression protocols, and what cannot be achieved irrespectively of the
algorithmic approach. This type of result is known as “information theoretic”.
Shannon did however leave a major open problem with his theory. While he proves that there
exists codes achieving rates up to the capacity, or up to the entropy, Shannon does not provide
a construction of such codes. This is one of the first application of the “probabilistic method”,
pioneered by Paul Erdös around the same time. This also gave birth to coding theory, whose major
goal has been to filling in this gap, by constructing good codes with manageable complexities. By
now, coding theory has also expended into a field of its own, with a broad spectrum of applications
in electrical engineering, computer science and discrete mathematics.
Shortly after Shannon’s seminal paper, Richard Hamming published in 1950 “Error Detecting
and Error Correcting Codes”, also in the Bell System Technical Journal, which contains some of
the fundamental concepts of coding theory. Hamming focused not on the probabilistic model of
Shannon, but on a worst-case noise model. More specifically, Hamming developed coding schemes
which can correct a fixed number of errors with probability one, as opposed to Shannon who
contented himself to correct errors with high probability on the noise realizations.
The goal of this note is to provide a few examples of fundamental coding theory results, in both
the source and channel setting, and in both the worst-case and probabilistic setting.
2
Some teasers
Erasure coding. You would like to encode music on a CD which involves only four notes: A,B,C,D.
One possibility is to encode A with 00, B with 01, C with 10 and D with 11. However, this encoding
is not robust to erasures, for example, observing ”0?” does not allow the CD player to decide between
A and B. A single scratch would hence spoil the CD. One possibility to prevent this is to triple each
1
bit using the four codewords: 000000, 000111, 111000, 111111. This allows to correct two erasures.
With codewords of length 6, is there a smarter way to chose the binary sequence for each note and
correct up to three erasures (irrespectively of the erasures’ locations)?
The hat puzzle. Three players enter a room and a red or blue hat is placed on each person’s
head. The color of each hat is determined by a coin toss, with the outcome of one coin toss
having no effect on the others. Each person can see the other players’ hats but not his own. No
communication of any sort is allowed, except for an initial strategy session before the game begins.
Once they have had a chance to look at the other hats, the players must simultaneously guess the
color of their own hats or pass. The group shares a hypothetical 3 million prize if at least one
player guesses correctly and no players guess incorrectly. The problem is to find a strategy for the
group that maximizes its chances of winning the prize. One obvious strategy for the players, for
instance, would be for one player to always guess “red” while the other players pass. This would
give the group a 50 percent chance of winning the prize. Can the group do better?
You can read more about this puzzle online.
3
3.1
Source coding: unstructured codes
Probabilistic model
Definitions. F2 = {0, 1}, and a Bernoulli(p) random variable takes values in F2 with probability
p of being 1. Let X n = (X1 , . . . , Xn ) be a random source, taking values in Fn2 . Assume that you
know the distribution of X n .
For example, this might a weather sequence, n = 365, Xk = 1 if it rains on day k, and Xk = 0
otherwise. In more realistic applications the source might represent the pixels of an image, the
letters of an english text, a PDF file, etc. You would like to store the weather pattern of a year.
One option is to simply store the sequence X n , which requires 365 bits. Depending on where you
live, you may want to exploit the fact that the Xk ’s take the same value most of the time, e.g., in
Death Valley.
n
Source code. An almost lossless source code is a map Cn : Fn2 → Fm
2 , such that Cn (X ) allows to
n
recover X n with high probability. Formally, there exists a decoding map Dn : Fm
2 → F2 , such that
P{Dn (Cn (X n )) 6= Xn } → 0,
as n → ∞.
(1)
Call m the dimension of the source code.
Example. Assume that X1 = 0 with probability 1/2, and Xk = X1 for all k ≥ 1. Then a good
source code is to store X1 only. This indeed gives a zero error probability code which requires only
one bit to be stored.
The source model. Assume now that X n is i.i.d. Bernoulli(p), n ≥ 1, p ∈ [0, 1].
Problem: For what r ∈ [0, 1] can we find a sequence of source codes Cn of dimension m = m(n)
such that m(n)/n → r? In words, how much compression rate can we get?
If r = 1, there is an easy solution, pick m(n) = n, Cn to be the identity map and Dn to be the
identity function. We call r an achievable rate if the above problem can be solved, and denote by
r∗ (p) the infimum of the achievable rates.
Theorem 1 (Shannon ’48). For any p ∈ [0, 1],
r∗ (p) = −p log2 (p) − (1 − p) log2 (1 − p).
The above function is called the entropy function, we denote it by H(p).
2
Proof of the achievability. Define the set Tε,n (p) = {xn ∈ Fn2 :
law of large numbers,
n
P{X ∈
/ Tε,n (p)} = P{
n
X
1
n
Pn
i=1 xi
∈ [p − ε, p + ε]}. By the
Xi ∈
/ (p − ε, p + ε)} → 0.
(2)
i=1
Hence, it is sufficient to encode all sequence of Tε,n (p) and to discard the other sequences. Note
that
n
= 2nH(p)+o(n) .
(3)
dnpe
Hence,
|Tε,n (p)| ≤ 2n(H(p)+δ(ε))+o(n) ,
(4)
with δ(ε) → 0 when ε → 0. Hence, we can encode all the sequences of Tε,n (p) with roughly
H(p) + δ(ε) bits, and encode all other sequences to the all 0 sequence, which induces errors but
which happens with vanishing probability.
One can show that the above theorem still holds for a source code which is linear, i.e., satisfying
C(x+y) = C(x)+C(y). Elias showed this in 1958. The linearity property is relevant for complexity
considerations and for applications to channel coding. We will investigate this in the next sections.
3.2
Worst-case model
Linear model compression. Given a set M ⊆ Fn2 called the model, find a matrix A ∈ Fm×n
2
such that no two elements of M map to the same image. Equivalently, for any x ∈ M , the only
element in M mapping to Ax is x.
Definition 1.
(1) A is called M -invertible if it solves the above problem.
(2) The compression dimension of M is defined by
∆(M ) =
min
A∈M -invertible
rank(A).
(5)
Fact: A is M -invertible if and only if
{M + M } ∩ ker(A) = 0.
(6)
Note that M + M may not be close under addition, hence embedding M into a vector space V and
choosing ker(A) = V ⊥ may not be optimal.
Definition 2. The Hamming ball is defined by
Bn (k) = {x ∈ Fn2 : w(x) ≤ k}
(7)
where w(x) denotes the Hamming weight of x.
Corollary 1. Let p = k/n. For p ≥ 1/2, ∆(Bn (k)) = n + o(n), and for p < 1/2,
nH(p) + o(n) ≤ ∆(Bn (k)) ≤ nH(2p) + o(n).
3
(8)
Proof. The lower bound follows from the injective property of A on M , which can always be
achieved with non linear maps, and the fact that
|Bn (k)| =
np X
n
j
j=0
= 2n max(H(p),1)+o(n) .
(9)
For the upper bound, consider A of dimension m with i.i.d. Bernoulli(1/2) entries. The probability
that A is M -invertible is given by
P{∀x, x0 ∈ M, x 6= x0 : Ax 6= Ax0 } = P{∀v ∈ M + M \ {0} : Av 6= 0}
= 1 − P{∃v ∈ M + M \ {0} : Av = 0}
≥ 1 − min((|M + M | − 1)2
−m
, 1).
(10)
(11)
(12)
Hence, if m ≥ blog2 (|M + M | − 1)c + 1, the above is strictly positive. The result follows then from
the fact that
Bn (k) + Bn (k) = Bn (2k).
4
(13)
Source compression: structured codes
4.1
Worst-case model: Reed-Muller codes
Definition 3. Let l ≥ 0, n = 2l ,
Al =
(i)
Al
⊗l
1 1
,
0 1
(14)
denote the i-th row of Al , and for t ∈ {0, 1, . . . , l},
(i)
(i)
Al (t) = {Al : w(Al ) ≥ 2t },
(15)
dl (t) =
(16)
min
d(x, y).
x,y∈hAl (t)i,
x6=y
In words, Al (t) is the submatrix of Al whose rows have weight at least t, and dl (t) is the minimal
distance among the span of these rows.
We next present two key properties of Al (t).
Lemma 1. For l ≥ 0, t ∈ {0, 1, . . . , l}, dl (t) = 2t .
Lemma 2. For l ≥ 0, t ∈ {0, 1, . . . , l}, ker(Al (t)) = hAl (l − t + 1)i.
This means that the kernel of Al (t) is given by the span of the rows of Al (l − t + 1).
Proposition 1. Let l ≥ 0, n = 2l , and t ∈ {0, 1, . . . , l}. The matrix Al (t) is Bn (k)-invertible if
and only if
k/n < 2−t .
4
(17)
Corollary 2. For k a power of 2, the least dimension of Al (t) to be Bn (k)-invertible is achieved
by t = log2 (n/k) − 1 and equals
log2 (n/k)−2 l
.
j
X
n−
j=0
(18)
√
This means that to get a constant dimensionality reduction factor, we need k = o( n).
Proof of Proposition 1. Note that
max
w(v) = 2k
(19)
w(v) = 2l−t+1 = 2n2−t .
(20)
v∈Bn (k)+Bn (k)
and from Lemma 1 and Lemma 2,
min
v∈ker(Al (t))\0
If 2k < 2n2−t , then {Bn (k) + Bn (k)} ∩ ker(Al (t)) = 0 and Al (t) is Bn (k)-invertible. If 2k = 2n2−t ,
pick the element in ker(Al (t)) of weight 2k and express it as the sum of two elements in Bn (k),
then Al (t) maps these two elements to the same image.
4.2
Probabilistic model: Polar codes
We now consider matrices An which exploit their structure not in the sparsity, they are indeed
dense, but in their recursive form. Let
⊗n
1 1
Gn =
,
0 1
n = 2k ,
k ≥ 1.
The above matrix pattern corresponds to the Sierpinski’s triangle. We consider matrices obtained
by selecting a subset of the rows of Gn , denoting by Gn [S] the matrix obtained by keeping the rows
of Gn indexed by S ⊆ [n].
Note that, since Gn is invertible,
n
n
nH(p) = H(X ) = H(Y ) =
n
X
H(Yi |Y i−1 ).
(21)
i=1
An interesting feature of Gn is provided by the following “polarization phenomenon”.
Theorem 2 (Arıkan ’08). For any ε > 0,
|{i ∈ [n] : H(Yi |Y i−1 ) ∈ (ε, 1 − ε)}| = o(n),
0.49
and the following holds if ε = ε(n) = 2−n
(22)
.
Hence defining
Rε,n (p) := {i ∈ [n] : H(Yi |Y i−1 ) ≥ ε},
we obtain the following.
5
(23)
Lemma 3. Let ε = ε(n) = 2−n
code, and
0.49
, then Aε,n = Gn [Rε,n ] is a linear and almost lossless source
rank(Aε,n ) = nH(p) + o(n).
(24)
Moreover, the main attribute of the polar code matrix is that the successive MAP decoding of
c can be done in O(n log n). The only caveat is that the characterisation of R
the bits in Rε,n
ε,n is not
2
well-understood (it shows an interesting fractal structure), and one has to rely on approximation
algorithms to estimate that set.
5
Channel coding
We only discuss here the case of the binary symmetric channel (BSC). Consider the following
communication protocol. Alice wishes to transmit information to Bob. The message set is assumed
to be {1, . . . , M } (this can represent an indexing of the real messages, such as the keyboard letters).
Each message m is encoded into a binary string xn (m) of length n. This is the codebook which is
accessible to both Alice and Bob. To transmit a message m encoded by a sequence xn (m), Alice
feeds the sequence to the BSC(p) channel, which produces a random output Y n = xn (m) + Z n ,
where Z n is i.i.d. Bernoulli(p). Upon observing the output Y n , Bob declares a message m̂ = D(Y n ).
The goal is to construct a codebook and a decoding with a vanishing error probability when n tends
to infinity no matter which message is sent — this defines reliable communication. The rate of the
code is given by log2 (M )/n, and a rate r is achievable if there exists a codebook and a decoding
map which achieve reliable communication with log2 (M )/n → r. The supremum of the achievable
rates is called the capacity of the BSC.
Theorem 3 (Shannon ’48). The capacity of the BSC(p) is given by 1 − H(p).
Proof of achievability. Consider applying a linear almost lossless source encoder to the output Y n =
xn + Z n , to obtain An Y n = Ax xn + An Z n . If xn is in the kernel of An , the An Y n = An Z n and we
can recover Z n with high probability, hence xn too. The dimension of An is nH(p) + o(n), hence
by the rank theorem, the dimension of the kernel of An is n(1 − H(p)) + o(n), which proves the
claim.
We did not discuss converse results, i.e., showing that rates above capacity cannot be achieved.
This relies on Fano’s inequality. Note that the above proof technique call also be adapted to obtain
channel codes from source codes in the worst-case model.
6
© Copyright 2026 Paperzz