Lecture 3: Shannon`s theorem and some upper and lower bounds 1

642:612,198:672 Topics in Error Correcting Codes
Spring 2009
Lecture 3: Shannon’s theorem and some upper and lower bounds
Jan 27, 2009
Lecturer: Adi Akavia
1
Scribe: Darakhshan J Mir
Shannon’s Communication Model
Shannon’s communication model consists of a source that generates digital information, which is sent to
the receiver over a communication channel. The channel may introduce errors or it may be noiseless. The
model incorporates both variants:
1. Noisy Channel: This channel introduces a noise in the data during transmission and hence the receiver
gets a noisy version of the transmitted data. Therefore, for reliable communication, redundancy has to
be added to the data generated by the source. This redundancy is then removed at the receiver’s end.
This process of introducing redundancy in the data to make it more resilient to noise is call Channel
coding.
2. Noiseless Channel: This channel does not introduce any error in the transmission. The redundancy
of the source data is exploited by compressing the source data and then sending it over the channel.
The data at the receiver’s end is then decompressed. This process is called Source Coding.
Both the source coding and channel coding can be combined to formulate a more general coding framework, however Shannon through his Source coding theorem and Channel coding theorem decoupled both
these aspects of communication to enable study of these parts separately.
To model noise, Shannon proposed a stochastic model of the communication channel.The input to the
channel is assumed to come from an input alphabet X and the output from some output alphabet Y . The
process of transmission is specified as a transition matrix of probabilities of the following kind:
For any pair (x, y) ∈ X × Y , if P r(y|x) denotes the probability that the output of the channel is y when
the input is x, then the transition matrix is given by:
M(x, y) = P r(y|x)
We look at some specific kinds of channels:
1.1
Binary Symmetric Channel(BSC)
The Binary Symmetric Channel with crossover probability p, 0 ≤ p ≤
matrix, for which X = {0, 1}, Y = {0, 1}, and:
P r(0|0) = P r(1|1) = 1 − p
and
P r(0|1) = P r(1|0) = p
A bit is therefore flipped with probability p.
1
1
2
is modelled as a 2 × 2 transition
1.2
Binary Erasure Channel(BEC)
For the BEC with erasure probability p 0 ≤ p ≤
erasure, the transition matrix is:
1
2,
X = {0, 1} and Y = {0, 1, ?}, where ? denotes an
P r(?|0) = P r(?|1) = p
and
P r(0|0) = P r(1|1) = 1 − p
1.3
Noisy Typewriter Channel
In the noisy typewriter channel X = Y = Σ where |Σ| = n and if we consider an order on the elements of
σi ∈ Σ then
1
P r(σi |σi ) = , ∀i ∈ [n]
2
and
1
P r(σ(i+1) mod n |σi ) = , ∀i ∈ [n]
2
In other words a symbol has a probability of 12 of being corrupted to the next symbol in the ordering.
We are interested in reducing the error introduced by the channel during communication, by as much as
possible. Ideally we would like to have zero-error communication, however with a BSC, this is not possible
while maintaining a positive rate. This is because there is a positive probability which with a bit will be
flipped. however, we may reduce this probability of error arbitrarily by using a very large repetition code.
The flip side is that as the probability of error is reduced arbitrarily, the rate tends to zero. A fundamental
question that Shannon asked was whether we can achieved any desired probability of error while maintaining
a positive rate. Shannon’s theorem for BSC answers this question in the affirmative and characterizes the
largest possible rate.
2
Shannon Theorem for Binary Symmetric Channel(BSC)
Theorem 2.1.
1. R < 1 − H(p), there exists a real ε, an encoding function En : {0, 1}k → {0, 1}n and
a decoding function Dn : {0, 1}n → {0, 1}k such that the following holds for all m ∈ {0, 1}k :
P rN oise η f rom BSC [Dn (En (m) + η) 6= m] ≤ 2−εn
2. If R > 1 − H(p), then for n sufficiently large, and for every pair of encoding and decoding functions,
En : {0, 1}k → {0, 1}n and Dn : {0, 1}n → {0, 1}k :
P r[Decoding error] ≥ 1 − exp(−n)
We’ll use the following fact in the theorem.
Lemma 2.2 ( A tool: Chernoff bound). Let Y1 , Y2 , . . . Yn be i.i.d binary random variables such that P r[Yi =
1] = p and P r[Yi = 0] = 1 − p, then we know that:
E(Yi ) = p
2
and
E(
Then:
P r[|
1X
Yi ) = p
n
n
n
i=1
i=1
−ε2 n
1X
1X
Yi − E(
Yi | > ε] ≤ 2 3
n
n
Proof of part(1). Let En : {0, 1}k → {0, 1}n be a random encoding, that maps every message, m ∈
{0, 1}k=Rn , to a uniformly random codeword En (m) ∈ {0, 1}n .
The decoding function Dn : {0, 1}n → {0, 1}k is defined as:
Dn (y) = arg
min
m∈{0,1}k
∆(y, En (m))
where y is the received codeword.
In other words the decoding function is a maximum likelihood decoding(MLD) function, choosing the
codeword that is closest to the received codewords.
Fix a message m and an encoding function En , let y = En (m) + η where η is the noise introduced by the
BSC. Let p(y|En (m)) denote the probability(over the channel noise) that the received codeword is y when
the transmitted codeword was En (m), B(x, r) denote the Hamming ball of radius r around the vector x, and
1(.) denote the indicator function. The proof will comprise of two steps:
1. First we prove that for a fixed message m the average probability of error (over the choice of En ) is
exponentially small. By the probabilistic method, this implies the existence of a code with very low
decoding probability of error for the message m.
2. Then we prove that the result holds for every message, for which we remove some “undesirable”
codewords from the code and will prove that the expected number of codewords removed is very low.
We have:
P rBSC noise [Dn (En (m) + noise) 6= m]
X
=
p(y|En (m)).1Dn (y)6=m
y∈{0,1}n
X
≤
ε
y ∈B(E
/
n (m),(p+ 2 )n)
−( 2ε )2 . n
2
≤2
+
X
p(y|En (m)) +
p(y|En (m)).1Dn (y)6=m
y∈B(En (m),(p+ 2ε )n)
X
p(y|En (m)).1Dn (y)6=m
y∈B(En (m),(p+ 2ε )n)
The last inequality follows from the Chernoff bound. Now, taking expectations on both sides we have, and
using η for the noise vector:
X
ε 2 n
P r[noise = η].E[1Dn (η+En (m))6=m ]
E[P rnoise [Dn (En (m) + noise) 6= m]] ≤ 2−( 2 ) . 2 +
η∈B(0,(p+ 2ε )m)
(1)
3
We have:
E[1Dn (η+En (m))6=m ]
ε
≤ P rchoiceof En [∃m0 6= m, such that∆(η + En (m), En (m0 )) ≤ (p + )n]
2
X
ε
0
≤
P r[∆(En (m) + η, En (m )) ≤ (p + )n]
2
0
m 6=m
V ol(0, (p + 2ε )n) k
.2
2n
ε
2H(p+ 2 )n k
≤
.2
2n
≤
(2)
(3)
Here the third inequality follows from the fact that En (m0 ) is chosen uniformly at random. Inequality
(3) follows from the volume bound. Using (3) in (1) we have:
ε 2 n
.2
E[P rnoise [Dn (En (m) + noise) 6= m]] ≤ 2−( 2 )
ε
X
+
P r[noise = η].2k+(H(p+ 2 )−1)n
η∈B(0,(p+ 2ε )m)
ε 2 n
.2
≤ 2−( 2 )
ε2 n
.8
= 2− 2
ε
+ 2k−(1−H(p+ 2 ))n
ε
+ 2−(H(p+ε)−H(p+ 2 ))n
0
≤ 2−δ n
(4)
P
The second inequality follows from the fact that η∈B(0,)p+ ε )n) P r[noise = η] ≤ 1 and using k =
2
(1 − H(p + ε))n and inequality (4) follows by setting δ 0 appropriately.
Since (4) hold for every message, we have:
Em Echoice of
En [P rnoise [Dn (En (n)
0
+ noise) 6= m]] ≤ 2−δ n
Changing the order of expectations, we infer that there exists some encoding function En∗ , such that:
0
Em [P r[Dn∗ (En∗ (m) + noise) 6= m]] ≤ 2−δ n
(5)
We want the above to hold for every message m. We identify “undesirable” codewords in the following
manner. If we sort the messages m in order of their decoding error probabilities, we make the following
claim:
Claim 2.3. For the top 2k−1 messages in the ascending order specified above, the decoding error probabil0
ities are at most 2.2δ n
Proof of Claim. Assume that the claim is false. Then for the top 2k−1 messages the error probabilities are
0
strictly greater than 2.2δ n which contradicts (5)
Thus we throw away the top 2k−1 messages to get a decoding function En0 , which has an error of at most
2δn where δ = δ 0 − n1 . Also, the rate of En0 is less than that of En ∗ by n1 , which is negligible in comparison
to 1 − H(p + ε) and hence the same rate is maintained.
4
Proof of part(2). Part(2) of Shannon’s theorem implies that for the given conditions, for some m ∈ {0, 1}k :
P rnoise η of
BSC [Dn (En (m)
+ η) 6= m] ≥
1
2
+ η) 6= m] ≤
1
2
Assume, for contradiction that for every m ∈ {0, 1}k :
P rnoise η of
BSC [Dn (En (m)
Fix an m ∈ {0, 1}k and let y = En (m) + η. We have:
P r[Dn (y) = m] ≥
1
2
(6)
Let, Sm = {y|Dn (y) = m}, that is Sm is the set of received codewords that are decoded to m.
Assume En (m) was tranmitted, consider a “doughnut” S of radius [(p − ε)n, (p + ε)n] around En (m).
By Chernoff bound, we have:
P r[y ∈
/ [(p − ε)n, (p + ε)n] ≤ 2−Ω(ε
2 n)
¿From the above equation and 6 we have:
P r[Dn (y) = m|y ∈ [(p − ε)n, (p + ε)n]] ≥
or
1
2
− 2−Ω(ε n)
2
1
2
|Sm ∩ S| ≥ ( − 2−Ω(ε n) ).|S|
2
For large enough n we have:
1
|Sm ∩ S| ≥ |S|
4
(7)
¿From the volume bound we also have
: |S| ≥
n
pn
≥ 2H(p)n−o(n)
(8)
Now we have that for any m1 6= m2 , Sm1 and Sm2 are disjoint, hence:
X
2n =
|Sm |
m∈{0,1}k
X
≥
|Sm ∩ S|
m∈{0,1}k
≥
1
4
X
2H(p)n−o(n)
m∈{0,1}k
= 2k−2 .2H(p)n−o(n)
The second inequality follows from 7 and 8.
9 implies that k ≤ (1 − H(p))n + o(n) or that R ≤ 1 − H(p) + o(1), which is a contradiction.
5
(9)
Theorem 2.1 tells us that there is a real number characteristic of the BSC(its capacity) such that communication over this channel is possible if the rate is less than this number and if the rate is greater than this
number, communication is not possible. Shannon showed that using random codes we can achieve positive
values for both rate and relative distance,and provided characterizations of ”feasible” and ”infeasible” codes,
even though he did not provide a constructive proof. Furthermore, the maximum likelihood decoding used
in the proof takes exponential time. A question that emerges out of this is whether we can come up with an
explicit construction of a code of rate 1 − H(p + ε) with efficient decoding and encoding, so as to achieve
reliable communication.
The fundamental incompatibility between the rate R = nk of a code and it’s relative distance nδ had been
observed by Hamming as well in a different model. Hamming’s theory provides explicit constructions of
codes and does not model the errors probabilistically but rather accounts for the worst-case errors. In the
next section we study the relationship between the rate R = nk and the relative distance δ = nd of codes and
study some upper and lower bounds that such relationships impose on the rate.
3
3.1
Gilbert-Varashamov Bounds
The Gilbert bound
Theorem 3.1. For every 0 ≤ δ ≤
that:
1
2
there exists a family of codes of rate R and relative distance δ such
R ≥ 1 − H(δ) − o(1)
3.2
The Varashamov bound
Theorem 3.2. ∀ 0 ≤ δ ≤ 21 , ∃ a [n, k, d(≥ δn)]2 code, such that:
R ≥ 1 − H(δ) − o(1)
Proof. Pick a random parity matrix H ∈ {0, 1}n−k×n . Define the code:
C = {y ∈ {0, 1}n | Hy = 0}
We are done if we prove that: d(C) = miny∈C wt(y) We use the probabilistic method to prove the
following:
If k ≤ n(1 − H(δ)) then, wt(y) ≥ δn, ∀y with high probability over the choice of H.
Fix m ∈ {0, 1}n \ 0. Now we want to compute the following quantity and want it to be less than 1.
P r[∃y with wt(y) < δn such that Hy = 0]
If hi is the i-th row of H, then
P r[hhi , yi = 0] =
1
2
1 n−k
2
P r[∃y, wt(y) < δn such that Hy = 0]
P r[∀i = 1, . . . n − k, hhi , yi = 0] =
≤ |{y|wt(y) < δn}|.
6
1 n−k
2
= |Bδn (0)|
1 n−k
2
By Lemma 3.3 from Lecture 2.
≤ 2H(δ)n−(n−k)
We want the above to hold with probability < 1. This will hold iff for any δ, n, k, H(δ)n−(n−k) < 0.
In particular, if:
k ≥ n − H(δ)n − o(n)
or
R ≥ 1 − H(δ)n − o(1)
4
4.1
Negative results
Hamming bound
Theorem 4.1. For any binary code:
δ
R ≤ 1 − H( ) + o(1)
2
4.2
Singleton bound(Projection bound)
Theorem 4.2. For any code(not necessarily binary):
R ≤ 1 − δ + o(1)
Proof. Consider a (n, k, d)q code. We want to show that k ≤ n − d + 1, that is:
|C| ≤ q n−d+1
Define a projection map f : Σk → Σn−d+1 such that for c = {c1 , . . . cn }, f (c1 , c2 , . . . cn ) 7→
(c1 , c2 , . . . cn−d+1 ) = c0
Proposition 4.3. f is injective.
Proof. Consider two codewords c1 6= c2 , since the distance is d they must differ in at least d positions and
thus f (c1 ) 6= f (c2 )
Since f is injective, the no. of codewords, |C| ≤ q n+d−1
7
4.3
Plotkin bound
Theorem 4.4. For any [n, k, d]2 code:
1. If d = n2 , then k ≤ 1 + log n.
2. If d ≥ (1 + ε) n2 , then k ≤ log(1 + ε)
3. k ≤ 1 − 2δ + o(1)
Proof of (1) =⇒ (2). Let C be a code as given. We use the same projection strategy as the proof of
Theorem 4.2. Consider the map f : Σk → Σn−d+1 such that for c = {c1 , . . . cn }, f (c1 , c2 , . . . cn ) 7→
(c1 , c2 , . . . cn−d+1 ) = c0
For a given x ∈ {0, 1}n−2d+1 , define
Cx = {(cn−2d+2 , cn−2d+3 , . . . , cn ) ∈ {0, 1}2d−1 : f (c1 , c2 , . . . , cn ) = x}
.
That is, Cx is the set of all suffixes of size n − 2d + 1 of codewords with prefix x.Since the elements
of Cx correspond to codewords in C having a prefix of size n − 2d + 1, and every pair of elements in C
are at least a distance d apart, we have, distance of Cx , d(Cx ) ≥ d. Cx , is therefore, a code of block length
n0 = 2d − 1 and distance d for all x ∈ Cx . We have:
|Cx | ≥
|C|
|C|
= n−2d+1
no.of pref ixes
2
Now applying part(1) of the theorem to the code Cx , we have:
0
|Cx | ≤ 21+log n ≤ 4d
So we have
|C| ≤ 2n2d+1 .4d
=⇒ k ≤ n − 2d + 1 + o(1)
=⇒ R ≤ 1 − 2δ + o(1)
For parts(1) and (2) we’ll use the following lemma.
Lemma 4.5. Let v1 , v2 . . . vn be unit vectors in Rn
1. If hvi , vj i ≤ 0 for all i 6= j then m < 1 + 2n, and
2. If ∃ε > 0 such that hvi , vj i ≤ −ε for all i 6= j then m ≤ 1 +
1
ε
Proof. We will prove the second part of the lemma here. Note that
X
X
0 ≤ kv1 + v2 + . . . vn k2 =
kvi k2 + 2.
kvi , vj k
i
≥ m + 2.
m(m − 1)
.−ε
2
8
i<j
= m − m(m − 1)ε
=⇒ m ≤ 1 +
1
ε
Proof of part (1) and (2). We will embed codewords as unit vectors in Rn . Consider a map f : {0, 1}n →
Rn defined as:
1
f (x1 , x2 . . . xn ) := √ ((−1)x1 , (−1)x2 , . . . , (−1)xn ) ∈ Rn
n
.
Notice that for c, c0 ∈ C
∆(c, c0 )
≤ 1 − 2δ
hf (c), f (c0 )i = 1 −
n
This is because if the ith bits of c and c are the same then the ith term in the expansion of hf (c), f (c0 )i is
1
−1
n , otherwise it is n
If δ ≥ 21 + 2ε , then
1 ε
hf (c), f (c0 )i ≥ 1 − 2( + )
2 2
≤ −ε
Let d =
have:
n
2.
Then for all c 6= c0 , hf (c), f (c0 )i ≤ 0. Since f is injective, from part(1) of Lemma 4.5 we
|C| ≤ 2n
=⇒ k ≤ (1 + log n
which proves part(1).
Assume d ≥ n2 (1 + ε). Then,
hf (c), f (c0 )i ≤ −ε
¿From part(2) of lemma 4.5 we have
1
ε
≤ log(1 + ε)
|C| ≤ 1 +
=⇒ k ≤ log(1 + ε)
which proves part(2)
9
4.4
Johnson bound
Let J(n, d, e) be the maximum no. of codewords in a ball of Hamming radius e for any binary code of block
length n and distance d.
Note: The subscript q will denote a q-ary code when relevant.
Definition 4.6.
1
Jq (δ) = (1 − )(1 −
q
and
J(δ) =
Theorem 4.7. If
e
n
<
√
1− 1−2δ
,
2
1−
s
1−
q d
)
q−1n
√
1 − 2δ
2
then J(n, d, e) ≤ 2nd.
References
[1] Rudra, Atri , Error Correcting Codes: Combinatorics, Algorithms and Applications, Lecture notes.
[2] Roman,, Steven, Coding and information theory, 1992,Springer-Verlag New York, Inc., New York, NY,
USA.
10