Maths response 9 coding theory

Responses: Coding Theory—From Books and Barcodes to
the Cosmos
Coding theory is a subject of immense importance to just about every walk of
modern life. The discussion below gives a snapshot of a few important methods
but any such summary can only scratch the surface. I hope you find time to look
further into the maths behind some of these and other uses of information theory
as it provides a really useful introduction to many topics in higher mathematics:
number theory, vector spaces, abstract algebra and more. Most of the apparatus
we use here would, even 50 years ago, have been placed firmly in the camp of pure
rather than applied mathematics, but the ubiquitous nature of the applications of
coding shows how maths continually finds uses in unexpected and exciting ways.
What is a Code?
We’re all familiar with the need to communicate information securely, in order that only the intended recipient can receive a given message. This discipline
is known as cryptography, from the Ancient Greek meaning “hidden writing”.
What is much more commonplace, however, and arguably more important, is the
ability to transmit information accurately, whether it is secret or, as in most cases,
not. A code is simply a means of concisely expressing information in a way that
enables economical and accurate transmission depending on the requirements of
the job.
In a very real sense, therefore, we are using an extremely well-known and successful code to communicate right now. It uses the 26 letters of the Roman
alphabet, and it’s called the English Language! Not only is this a very efficient
way to encode information (imagine this paragraph in hieroglyphics), it also enables accurate transmission of a message. The rules of spelling and grammar
place restrictions on the order and form of allowable words, and the evolution
of English over the last 1500 or so years has, like all widely-spoken languages,
produced a system which not only detects but corrects errors in transmission.
Take an extreme example like
TO BG OR NWT TO BE TKAT IS VHE QRESTJON
It doesn’t take too long to figure out what’s meant here, even with 6 out of 30
characters received incorrectly. So it looks like this code can handle a whopping
20% error rate and still enable successful decoding. Impressive!
1
Figure 1: Early prototype of a highly effective code.
There’s a sting in the tail, though. In order to develop such fast and powerful
“error-correction software”, our brains have had to spend most of our childhood
years hardwiring the many intricate rules and regulations of the code. In the
majority of instances (say reading a CD or scanning a barcode) the transmission,
reception and possible correction of data has to be achieved automatically by
something a little less clever (but much faster) than we are. And if we’re going to
use our mathematical toolkit to design these procedures, it makes sense to work
with numbers rather than letters.
What is an Error-Detecting Code?
This is best illustrated with a familiar example. Take any book published between
1970 and 2006 (the protocol changed slightly in 2007 owing to the increasing number of books in existence but the principle remains the same) and on the back
will be a 10-digit ISBN identification number of the form
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 ,
which is scanned at the bookshop till. Incorrect reading of this number results in
the customer being charged the price of a different book and out-of-kilter stock
control. This is avoided by using the first 9 digits to identify the book and taking
the tenth digit (the check digit) so that
10a1 + 9a2 + 8a3 + 7a4 + 6a5 + 5a6 + 4a7 + 3a8 + 2a9 + a10 ≡ 0 (mod 11) (1)
2
Thi is just another way of saying that the left-hand side must be an exact multiple
of 11, since “mod” equates to “remainder on division by”. Note that in some cases
the check digit must take the value 10, and X is used for this purpose. So, the
first task is to check that the ISBN of the nearest book to hand satisfies condition
(1). Next, see what happens if you alter any one of the digits in isolation. This
should cause the condition to fail whatever the error. Now try swapping two
adjacent digits. Again this should cause the ISBN to be interpreted as invalid.
These two errors are far and away the most common in this particular instance.
Incidentally, at this stage it is worthwhile to ask why 11 is chosen as the modulus
on the right-hand side. What makes this a better choice than, say, 10 or 12?
Figure 2: Error detection in practice.
There is nevertheless a limit to the error detection with this scheme. It is possible
to change two separated digits so that (1) is still satisfied, causing the spurious
ISBN to slip through the net undetected (though for a 1% chance of error for
any particular digit, the likelihood of this is less than 4 in 10000). Since two or
more errors can pass the check, but a single error is always highlighted, the code
is said to be 1-error detecting. Note, however, that there is no correcting capability here—the code flags any single error, but cannot automatically evaluate
which digit is incorrect. This is rarely a problem, since ultimately a bookseller
can contact the publisher to address the issue, and indeed it may by argued that
this is actually better in this case than a system which could “guess” the correct
sequence, but possibly guess wrong (shipping crates of books costs more than a
phone call!)
In some cases (laser scanning of CDs, pictures beamed from a space probe) we
do not have this option. In circumstances such as these, not only is it desirable
3
to detect errors, we must fashion a way to work out what the correct data should
have been. We’ll come to this shortly.
Some General Considerations
What should a corrupted codeword really be? In an earlier example, we decided
the erroneously-received QRESTJON should in fact have been QUESTION. Why not
BEETROOT, TOBOGGAN or AARDVARK? Given a small (≪ 0.5) probability of any particular letter being corrupted, the intended codeword is far more likely to be our
initial guess, with just 2 corrections, than the other suggestions, which require
6, 7 and 8 respectively. We can never be absolutely sure, but this is certainly a
sensible strategy and illustrates the principle of maximum likelihood decoding—if a received data string is not an admissible codeword, then the convention
is to assume the intended word is that which is “nearest” to the received data.
Next we have to define precisely what this means.
From hereon in, let’s assume our codewords take the form of a string of ones
and zeros. This is ultimately necessary for any form of digital transmission,
and besides there exist well-established conventions such as ASCII for turning
alphanumeric characters into 8-bit binary strings. Imagine a couple of codewords
c1 = 10110111, c2 = 11100101.
We define the Hamming distance d(c1 , c2 ) to be the number of places in which
two codewords have differing digits. Hence in this case
d(c1 , c2 ) = 3,
as c1 and c2 differ in their second, fourth and seventh elements. Continuing
with this form of 8-bit codewords, it would be perfectly feasible to allow every
possible binary string, which gives us 28 = 256 different codeword possibilities.
The problem is, if there are any errors whatsoever, these must remain undetected
since a wrongly received string will nevertheless be a bona fide codeword in its
own right. So the trick is to somehow restrict the allowable strings to a smaller
subset, and use the convention of maximum likelihood decoding to interpret an
erroneously-received string as the codeword which is the smallest Hamming distance away.
We are now in a position to define the two central properties of a particular
code C:
• We say a code is m error detecting if changing up to m digits in one
codeword never produces another; that is d(c1 , c2 ) > m for all ci , cj .
4
Figure 3: A 5-bit codeword and its cloud of strings with Hamming distance 1.
• We say a code is n error correcting if knowing that a received string
differs in at most n places from a codeword of C means that we can deduce
the codeword. So provided no received word has more than n errors, we
are guaranteed to reconstuct the entire message perfectly.
We’ve already seen that ISBN is 1 error detecting and 0 error correcting (albeit
in decimal not binary). At the other extreme is the repetition code. In order
to guard against channels with a high error rate, this works by taking each bit
of the original message data and transmitting repeatedly a designated number of
times to greatly reduce the chance of error. So for an 8-bit repetition code, say,
the only admissible codewords are
00000000
and
11111111.
This code is 7 error detecting (only if all 8 bits are corrupted will it slip through
the net) and 3 error correcting (eg 00010010 will be decoded as 00000000. Only
with 4 or more errors is there a possibility of a wrong decode.) Depending on
how noisy — that is, prone to errors — our communication channel is, we may
wish to use longer strings still to make the likelihood of wrong decoding even
smaller. Using the binomial distribution, you can investigate the chances of more
than half the digits being in error for a sequence of given length and bit error rate.
The flip side of this increased confidence, of course, is a big reduction in efficiency. To get our message across in the above case requires sending 8 times
the actual information contained. In effect each 1 or 0 in the original message is
followed by seven (hopefully identical) check digits. This code is said to have information rate (IR) equal to 1/8, as this is the proportion of “useful message”
in the transmission, with the rest being used to improve accuracy. As for those
schemes alluded to earlier, the trivial code in which all 8-bit strings are allowed
has information rate 1 but no error detection or correction capability. ISBN has
IR=9/10. The binary equivalent of this is the parity check code, in which we
5
break the message data into strings of fixed length n−1, say, and append a parity
check digit an so that the sum of the bits is even:
a1 + a2 + · · · + an−1 + an ≡ 0 (mod 2).
Here, IR=(n − 1)/n.
So there is a tradeoff between IR and confidence in accurate decoding. Most
practical codes fall somewhere between these two extremes, and the trick lies
in selecting that which best suits the particular task. Let’s now look at some
examples.
Hamming’s Code
This was the first error correcting code used to input progams via paper tape
(ask your Dad!) in the pioneering days of electronic computers from the 1950s
onwards. Here, our allowable codewords are 7 digits long and take the form
c = c1 c2 c3 c4 c5 c6 c7
in whch the digits obey the 3 restrictions
c1 + c3 + c5 + c7 = 0
c2 + c3 + c6 + c7 = 0
c4 + c5 + c6 + c7 = 0
(2)
working mod 2. A glance at these equations shows that c3 , c5 , c6 and c7 may be
chosen arbitrarily to be 0 or 1 but then the other (check) digits c1 , c2 , c4 are then
fixed. This means that only 16 of the 27 = 128 possible strings are codewords
and the information rate is 4/7. The relations (2) can be expressed in matrix
form as


c1


 c 


 2 


0
1 0 1 0 1 0 1 
 c3 





(3)
 0 1 1 0 0 1 1   c4  =  0  .




0
0 0 0 1 1 1 1  c5 
 c 
 6 
c7
Note that this 3 × 7 matrix is composed of the binary representations of 1, 2, 3,
4, 5, 6 and 7 if read upward. Error detection is easy. Suppose we receive a data
string
x = x1 x2 x3 x4 x5 x6 x7 .
6
Figure 4: Richard Hamming, coding pioneer. The jacket was an
error that should have been corrected.
Referring to the matrix above as H, to check if x is an admissible codeword
merely requires us to see if


0

(4)
Hx = 
 0 .
0
The error correction is the cunning bit. If we have a data string
x = 1001010
then


1

Hx =  1 
,
0
indicating that x is invalid. But reading the right-hand side upwards gives 3 in
binary. The correction algorithm then instructs us to alter the third digit of x:
1001010 −→ 1011010. Doing so gives a genuine codeword since condition (4) is
now satisfied. This is no fluke — try it for yourself by selecting a random 7-bit
string and seeing if (4) is obeyed. If not, the RHS vector tells us (reading upwards
in binary) which element of the string to correct.
See if you can find all 16 codewords by different choices of c3 , c5 , c6 and c7 .
You ought then to be able to verify that the minimum separation (Hamming
7
distance) of any two is 3. Since no two codewords are closer than this, then
Hamming’s code is 2 error detecting and 1 error correcting. Generally the degree
of error correction of a code is half that of error detection, rounded down. Have
a think why this is so. It helps to picture spherical clouds around each codeword
containing strings a certain distance away, and to consider how big these clouds
can be without overlap.
There’s another attractive feature of Hamming’s code. Each codeword c has
seven strings just 1-distant, obtained by changing each digit in turn (as in figure
3). These strings are inadmissible and would resolve to c under the correction
algorithm. There’s no overlap of each codeword cloud, which contains 8 strings (c
and its seven near neighbours), since the minimum distance between codewords
is 3. So all clouds together collectively contain 8 × 16 = 128 members, which
covers every possible 7-digit string. Such a code is called perfect. Every single
received data string is either a legitimate codeword or a single alteration away
from one.
Hamming’s code can be generalised. For instance we can design a 15-bit code
with a 4×15 parity check matrix H containing each of 1, 2,. . . ,15 written upwards
in binary. Going the other way, a 3-bit code with a 2 × 3 check matrix generates
a kind of simple code we’ve already discussed (which one?) In general there exist
Hamming codes of length 2N − 1 (N ≥ 2), each 1 error correcting and having a
minimum distance of 3. Have a think about the IR and number of codewords in
each case.
Hamming codes are tailor-made for situations in which very long sequences of
binary are to be communicated but the chance of any individual bit error is very
small, so that it is extremely unlikely any given string possesses 2 errors and is
thus decoded wrongly. For the next example, in contrast, we need a scheme with
more robust correction properties.
Space: The Final Frontier
Consider a space probe which is launched deep into the solar system to relay
information about distant planets and other celestial bodies. The difficulties in
sending data back to earth over several billion miles has been colourfully compared to signalling across the Atlantic with a child’s torch! NASA have to expect
a high error rate on receipt of such a weak signal, but also the code must be
moderately economical (unlike a repetition code) so as to ensure we get as much
mileage as possible out of the nuclear power source on board.
The technology devised for this is a Reed–Muller code, and here’s essentially
8
what goes on. First pick a positive integer N . Then we define N vectors of length
2N with 1s and 0s in blocks of 1, 2, 4,. . . , 2N −1 . So for instance with N = 4 these
vectors are
h1 = 1111111100000000
h2 = 1111000011110000
h3 = 1100110011001100
h4 = 1010101010101010
along with an additional vector containing only 1s:
h0 = 1111111111111111.
Additionally we need to define an overlap operator ∧, rather like an intersection
in set theory, which returns 1 only if both vectors have 1 in a particular place, so
for example
h1 ∧ h2 = 1111000000000000
h1 ∧ h3 = 1100110000000000
h2 ∧ h4 = 1010000010100000
This operation can be repeated. For instance
h1 ∧ h2 ∧ h4 = 1010000000000000.
Finally we introduce the idea of a generating set. This is a subset of codewords
which when combined by addition (remembering that in binary 1 + 1 = 0) give
us every codeword. To illustrate this, recall that the 4-bit parity check code
(codewords with an even number of 1s) consists of the elements
0000 0011 0101 0111
1001 1010 1100 1111
but you should satisfy yourself that any one of these can be arrived at by some
combination of
0011 0101 and 1001
Figure 5: Voyager 1 — 10 billion miles and still transmitting.
9
(the null codeword 0000 can be constructed by adding any element to itself).
These 3 elements are a generating set for the code.
Now we have all the machinery we need. The Reed–Muller RM(4,1) code is
the set of all codewords generated by h1 , h2 , h3 , h4 and h0 . These 5 elements
generate a codebook of 32 (=25 ) elements and the IR is 5/16 — pretty healthy.
But where Reed–Muller comes into its own is with error-prone data. The minimum distance between two codewords turns out to be 24−1 = 8 (try it for two
elements, eg d(h1 , h3 ) = 8). So RM(4,1) is 7 error detecting and 3 error correcting.
The second order code RM(4,2) is generated by the generating set of RM(4,1)
(h0 , h1 , h2 , h3 , h4 ) together with the six elements
hi ∧ hj
(i, j = 1, 2, 3, 4, i 6= j).
Here an IR of 11/16 results, with 3 errors detected and 1 corrected. Going further,
the RM(4,3) code also includes terms of the form
hi ∧ hj ∧ hk
in the generating set. Finally with RM(4,4), a last generating element
h1 ∧ h2 ∧ h3 ∧ h4 = 1000000000000000
is included, and every possible 16-bit string can be manufactured by combining
the 16 independent elements of the generating set. This is the familiar trivial
code with IR=1 but no correction properties.
You could try and work out similar parameters for other codes of the form
RM(d, r), where d ≥ r. Some findings you might like to check:
• The minimum Hamming distance of RM(d, r) is always 2d−r , from which
the correction capabilities follow
• RM(d, 0), generated by h0 alone, is the repetition code of length 2d
• RM(d, d − 1) is a parity check code which thus detects 1 error but cannot
correct
• The best payoffs between IR and correction power lie between these extremes
If this machinery seems complicated to you then take heart because you’re not
alone! It wasn’t discovered until just before the first satellites were sent into space
and now crops up in third-year undergraduate maths courses. The Voyager probes
10
Figure 6: This image of lightning storms in Jupiter’s atmosphere came back
courtesy of RM(5,1).
launched in 1977 use the RM(5,1) code, and to this day continue to deliver information as they exit the solar system some 10 billion miles away.
The decoding procedure (automatically finding which legitimate codeword is closest to a corrupted received string) is a little trickier to implement than for the
Hamming code. Note, though, that it is encryption that must be simple, straightforward and economical. Once the precious data has reached earth we can throw
the kitchen sink at decoding and take as much time as we wish, in principle.
Back to Earth With a Bang
When sending data from Voyager 1 to Houston, the code must be simple and
robust but the decoder can be as complex as required. On the other hand, the
encoding system which produces compact discs (in a Sony factory, say) can be
very expensive, whereas the decoder (the CD player) must be cheaply massproduced. Data on an audio CD is read at about 10MB per minute, and high
sound quality is only achieved if any errors are corrected in real time as the contents are read. In summary then, the decoding algorithm has to be simple and
fast.
CDs employ a special type of code called a cyclic code. Quite simply this
means that for an n-digit code if
a1 a2 a3 . . . an
11
Figure 7: Five billion ones and zeros.
is a codeword then
an a1 a2 . . . an−1
is also a codeword, so by iteration we can “wrap around” an arbitrary number of
places. We’ve seen an example if this structure with the parity check code. For
the 4-bit version,
1100 −→ 0110 −→ 0011 −→ 1001
are all elements of the codebook.
Certain types of cyclic codes are very useful for correcting “burst” errors. This
refers to a connected section of data being erroneous. Defects of this kind are
obviously common on CDs because of scratches. A particular variant, known as
a BCH code, is designed to be able to have an arbitrarily large minimum distance
between words, and is ideal for dealing with such problems. Using this, modern
CD readers are able to correct in real time a burst of 4000 consecutive errors,
which corresponds to about 2.5mm of track. It’s an interesting experiment to
take a CD (preferably no longer one of your favourites) and see how much it is
possible to black out with a permanent marker before your hi-fi protests!
Having scratched the surface (no pun) with a few applications, it is evident how
coding theory pervades pretty much every aspect of life involving information
or electronics. So I hope you’ll be motivated to go ahead and use the resources
at your disposal to delve further into what is a vital, developing area of maths.
And when you do — whether it’s buying a book on the subject, researching by
downloading webpages, or using your mobile phone to discuss it with a friend —
it will all be possible because of error-correcting codes.
12