Responses: Coding Theory—From Books and Barcodes to the Cosmos Coding theory is a subject of immense importance to just about every walk of modern life. The discussion below gives a snapshot of a few important methods but any such summary can only scratch the surface. I hope you find time to look further into the maths behind some of these and other uses of information theory as it provides a really useful introduction to many topics in higher mathematics: number theory, vector spaces, abstract algebra and more. Most of the apparatus we use here would, even 50 years ago, have been placed firmly in the camp of pure rather than applied mathematics, but the ubiquitous nature of the applications of coding shows how maths continually finds uses in unexpected and exciting ways. What is a Code? We’re all familiar with the need to communicate information securely, in order that only the intended recipient can receive a given message. This discipline is known as cryptography, from the Ancient Greek meaning “hidden writing”. What is much more commonplace, however, and arguably more important, is the ability to transmit information accurately, whether it is secret or, as in most cases, not. A code is simply a means of concisely expressing information in a way that enables economical and accurate transmission depending on the requirements of the job. In a very real sense, therefore, we are using an extremely well-known and successful code to communicate right now. It uses the 26 letters of the Roman alphabet, and it’s called the English Language! Not only is this a very efficient way to encode information (imagine this paragraph in hieroglyphics), it also enables accurate transmission of a message. The rules of spelling and grammar place restrictions on the order and form of allowable words, and the evolution of English over the last 1500 or so years has, like all widely-spoken languages, produced a system which not only detects but corrects errors in transmission. Take an extreme example like TO BG OR NWT TO BE TKAT IS VHE QRESTJON It doesn’t take too long to figure out what’s meant here, even with 6 out of 30 characters received incorrectly. So it looks like this code can handle a whopping 20% error rate and still enable successful decoding. Impressive! 1 Figure 1: Early prototype of a highly effective code. There’s a sting in the tail, though. In order to develop such fast and powerful “error-correction software”, our brains have had to spend most of our childhood years hardwiring the many intricate rules and regulations of the code. In the majority of instances (say reading a CD or scanning a barcode) the transmission, reception and possible correction of data has to be achieved automatically by something a little less clever (but much faster) than we are. And if we’re going to use our mathematical toolkit to design these procedures, it makes sense to work with numbers rather than letters. What is an Error-Detecting Code? This is best illustrated with a familiar example. Take any book published between 1970 and 2006 (the protocol changed slightly in 2007 owing to the increasing number of books in existence but the principle remains the same) and on the back will be a 10-digit ISBN identification number of the form a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 , which is scanned at the bookshop till. Incorrect reading of this number results in the customer being charged the price of a different book and out-of-kilter stock control. This is avoided by using the first 9 digits to identify the book and taking the tenth digit (the check digit) so that 10a1 + 9a2 + 8a3 + 7a4 + 6a5 + 5a6 + 4a7 + 3a8 + 2a9 + a10 ≡ 0 (mod 11) (1) 2 Thi is just another way of saying that the left-hand side must be an exact multiple of 11, since “mod” equates to “remainder on division by”. Note that in some cases the check digit must take the value 10, and X is used for this purpose. So, the first task is to check that the ISBN of the nearest book to hand satisfies condition (1). Next, see what happens if you alter any one of the digits in isolation. This should cause the condition to fail whatever the error. Now try swapping two adjacent digits. Again this should cause the ISBN to be interpreted as invalid. These two errors are far and away the most common in this particular instance. Incidentally, at this stage it is worthwhile to ask why 11 is chosen as the modulus on the right-hand side. What makes this a better choice than, say, 10 or 12? Figure 2: Error detection in practice. There is nevertheless a limit to the error detection with this scheme. It is possible to change two separated digits so that (1) is still satisfied, causing the spurious ISBN to slip through the net undetected (though for a 1% chance of error for any particular digit, the likelihood of this is less than 4 in 10000). Since two or more errors can pass the check, but a single error is always highlighted, the code is said to be 1-error detecting. Note, however, that there is no correcting capability here—the code flags any single error, but cannot automatically evaluate which digit is incorrect. This is rarely a problem, since ultimately a bookseller can contact the publisher to address the issue, and indeed it may by argued that this is actually better in this case than a system which could “guess” the correct sequence, but possibly guess wrong (shipping crates of books costs more than a phone call!) In some cases (laser scanning of CDs, pictures beamed from a space probe) we do not have this option. In circumstances such as these, not only is it desirable 3 to detect errors, we must fashion a way to work out what the correct data should have been. We’ll come to this shortly. Some General Considerations What should a corrupted codeword really be? In an earlier example, we decided the erroneously-received QRESTJON should in fact have been QUESTION. Why not BEETROOT, TOBOGGAN or AARDVARK? Given a small (≪ 0.5) probability of any particular letter being corrupted, the intended codeword is far more likely to be our initial guess, with just 2 corrections, than the other suggestions, which require 6, 7 and 8 respectively. We can never be absolutely sure, but this is certainly a sensible strategy and illustrates the principle of maximum likelihood decoding—if a received data string is not an admissible codeword, then the convention is to assume the intended word is that which is “nearest” to the received data. Next we have to define precisely what this means. From hereon in, let’s assume our codewords take the form of a string of ones and zeros. This is ultimately necessary for any form of digital transmission, and besides there exist well-established conventions such as ASCII for turning alphanumeric characters into 8-bit binary strings. Imagine a couple of codewords c1 = 10110111, c2 = 11100101. We define the Hamming distance d(c1 , c2 ) to be the number of places in which two codewords have differing digits. Hence in this case d(c1 , c2 ) = 3, as c1 and c2 differ in their second, fourth and seventh elements. Continuing with this form of 8-bit codewords, it would be perfectly feasible to allow every possible binary string, which gives us 28 = 256 different codeword possibilities. The problem is, if there are any errors whatsoever, these must remain undetected since a wrongly received string will nevertheless be a bona fide codeword in its own right. So the trick is to somehow restrict the allowable strings to a smaller subset, and use the convention of maximum likelihood decoding to interpret an erroneously-received string as the codeword which is the smallest Hamming distance away. We are now in a position to define the two central properties of a particular code C: • We say a code is m error detecting if changing up to m digits in one codeword never produces another; that is d(c1 , c2 ) > m for all ci , cj . 4 Figure 3: A 5-bit codeword and its cloud of strings with Hamming distance 1. • We say a code is n error correcting if knowing that a received string differs in at most n places from a codeword of C means that we can deduce the codeword. So provided no received word has more than n errors, we are guaranteed to reconstuct the entire message perfectly. We’ve already seen that ISBN is 1 error detecting and 0 error correcting (albeit in decimal not binary). At the other extreme is the repetition code. In order to guard against channels with a high error rate, this works by taking each bit of the original message data and transmitting repeatedly a designated number of times to greatly reduce the chance of error. So for an 8-bit repetition code, say, the only admissible codewords are 00000000 and 11111111. This code is 7 error detecting (only if all 8 bits are corrupted will it slip through the net) and 3 error correcting (eg 00010010 will be decoded as 00000000. Only with 4 or more errors is there a possibility of a wrong decode.) Depending on how noisy — that is, prone to errors — our communication channel is, we may wish to use longer strings still to make the likelihood of wrong decoding even smaller. Using the binomial distribution, you can investigate the chances of more than half the digits being in error for a sequence of given length and bit error rate. The flip side of this increased confidence, of course, is a big reduction in efficiency. To get our message across in the above case requires sending 8 times the actual information contained. In effect each 1 or 0 in the original message is followed by seven (hopefully identical) check digits. This code is said to have information rate (IR) equal to 1/8, as this is the proportion of “useful message” in the transmission, with the rest being used to improve accuracy. As for those schemes alluded to earlier, the trivial code in which all 8-bit strings are allowed has information rate 1 but no error detection or correction capability. ISBN has IR=9/10. The binary equivalent of this is the parity check code, in which we 5 break the message data into strings of fixed length n−1, say, and append a parity check digit an so that the sum of the bits is even: a1 + a2 + · · · + an−1 + an ≡ 0 (mod 2). Here, IR=(n − 1)/n. So there is a tradeoff between IR and confidence in accurate decoding. Most practical codes fall somewhere between these two extremes, and the trick lies in selecting that which best suits the particular task. Let’s now look at some examples. Hamming’s Code This was the first error correcting code used to input progams via paper tape (ask your Dad!) in the pioneering days of electronic computers from the 1950s onwards. Here, our allowable codewords are 7 digits long and take the form c = c1 c2 c3 c4 c5 c6 c7 in whch the digits obey the 3 restrictions c1 + c3 + c5 + c7 = 0 c2 + c3 + c6 + c7 = 0 c4 + c5 + c6 + c7 = 0 (2) working mod 2. A glance at these equations shows that c3 , c5 , c6 and c7 may be chosen arbitrarily to be 0 or 1 but then the other (check) digits c1 , c2 , c4 are then fixed. This means that only 16 of the 27 = 128 possible strings are codewords and the information rate is 4/7. The relations (2) can be expressed in matrix form as c1 c 2 0 1 0 1 0 1 0 1 c3 (3) 0 1 1 0 0 1 1 c4 = 0 . 0 0 0 0 1 1 1 1 c5 c 6 c7 Note that this 3 × 7 matrix is composed of the binary representations of 1, 2, 3, 4, 5, 6 and 7 if read upward. Error detection is easy. Suppose we receive a data string x = x1 x2 x3 x4 x5 x6 x7 . 6 Figure 4: Richard Hamming, coding pioneer. The jacket was an error that should have been corrected. Referring to the matrix above as H, to check if x is an admissible codeword merely requires us to see if 0 (4) Hx = 0 . 0 The error correction is the cunning bit. If we have a data string x = 1001010 then 1 Hx = 1 , 0 indicating that x is invalid. But reading the right-hand side upwards gives 3 in binary. The correction algorithm then instructs us to alter the third digit of x: 1001010 −→ 1011010. Doing so gives a genuine codeword since condition (4) is now satisfied. This is no fluke — try it for yourself by selecting a random 7-bit string and seeing if (4) is obeyed. If not, the RHS vector tells us (reading upwards in binary) which element of the string to correct. See if you can find all 16 codewords by different choices of c3 , c5 , c6 and c7 . You ought then to be able to verify that the minimum separation (Hamming 7 distance) of any two is 3. Since no two codewords are closer than this, then Hamming’s code is 2 error detecting and 1 error correcting. Generally the degree of error correction of a code is half that of error detection, rounded down. Have a think why this is so. It helps to picture spherical clouds around each codeword containing strings a certain distance away, and to consider how big these clouds can be without overlap. There’s another attractive feature of Hamming’s code. Each codeword c has seven strings just 1-distant, obtained by changing each digit in turn (as in figure 3). These strings are inadmissible and would resolve to c under the correction algorithm. There’s no overlap of each codeword cloud, which contains 8 strings (c and its seven near neighbours), since the minimum distance between codewords is 3. So all clouds together collectively contain 8 × 16 = 128 members, which covers every possible 7-digit string. Such a code is called perfect. Every single received data string is either a legitimate codeword or a single alteration away from one. Hamming’s code can be generalised. For instance we can design a 15-bit code with a 4×15 parity check matrix H containing each of 1, 2,. . . ,15 written upwards in binary. Going the other way, a 3-bit code with a 2 × 3 check matrix generates a kind of simple code we’ve already discussed (which one?) In general there exist Hamming codes of length 2N − 1 (N ≥ 2), each 1 error correcting and having a minimum distance of 3. Have a think about the IR and number of codewords in each case. Hamming codes are tailor-made for situations in which very long sequences of binary are to be communicated but the chance of any individual bit error is very small, so that it is extremely unlikely any given string possesses 2 errors and is thus decoded wrongly. For the next example, in contrast, we need a scheme with more robust correction properties. Space: The Final Frontier Consider a space probe which is launched deep into the solar system to relay information about distant planets and other celestial bodies. The difficulties in sending data back to earth over several billion miles has been colourfully compared to signalling across the Atlantic with a child’s torch! NASA have to expect a high error rate on receipt of such a weak signal, but also the code must be moderately economical (unlike a repetition code) so as to ensure we get as much mileage as possible out of the nuclear power source on board. The technology devised for this is a Reed–Muller code, and here’s essentially 8 what goes on. First pick a positive integer N . Then we define N vectors of length 2N with 1s and 0s in blocks of 1, 2, 4,. . . , 2N −1 . So for instance with N = 4 these vectors are h1 = 1111111100000000 h2 = 1111000011110000 h3 = 1100110011001100 h4 = 1010101010101010 along with an additional vector containing only 1s: h0 = 1111111111111111. Additionally we need to define an overlap operator ∧, rather like an intersection in set theory, which returns 1 only if both vectors have 1 in a particular place, so for example h1 ∧ h2 = 1111000000000000 h1 ∧ h3 = 1100110000000000 h2 ∧ h4 = 1010000010100000 This operation can be repeated. For instance h1 ∧ h2 ∧ h4 = 1010000000000000. Finally we introduce the idea of a generating set. This is a subset of codewords which when combined by addition (remembering that in binary 1 + 1 = 0) give us every codeword. To illustrate this, recall that the 4-bit parity check code (codewords with an even number of 1s) consists of the elements 0000 0011 0101 0111 1001 1010 1100 1111 but you should satisfy yourself that any one of these can be arrived at by some combination of 0011 0101 and 1001 Figure 5: Voyager 1 — 10 billion miles and still transmitting. 9 (the null codeword 0000 can be constructed by adding any element to itself). These 3 elements are a generating set for the code. Now we have all the machinery we need. The Reed–Muller RM(4,1) code is the set of all codewords generated by h1 , h2 , h3 , h4 and h0 . These 5 elements generate a codebook of 32 (=25 ) elements and the IR is 5/16 — pretty healthy. But where Reed–Muller comes into its own is with error-prone data. The minimum distance between two codewords turns out to be 24−1 = 8 (try it for two elements, eg d(h1 , h3 ) = 8). So RM(4,1) is 7 error detecting and 3 error correcting. The second order code RM(4,2) is generated by the generating set of RM(4,1) (h0 , h1 , h2 , h3 , h4 ) together with the six elements hi ∧ hj (i, j = 1, 2, 3, 4, i 6= j). Here an IR of 11/16 results, with 3 errors detected and 1 corrected. Going further, the RM(4,3) code also includes terms of the form hi ∧ hj ∧ hk in the generating set. Finally with RM(4,4), a last generating element h1 ∧ h2 ∧ h3 ∧ h4 = 1000000000000000 is included, and every possible 16-bit string can be manufactured by combining the 16 independent elements of the generating set. This is the familiar trivial code with IR=1 but no correction properties. You could try and work out similar parameters for other codes of the form RM(d, r), where d ≥ r. Some findings you might like to check: • The minimum Hamming distance of RM(d, r) is always 2d−r , from which the correction capabilities follow • RM(d, 0), generated by h0 alone, is the repetition code of length 2d • RM(d, d − 1) is a parity check code which thus detects 1 error but cannot correct • The best payoffs between IR and correction power lie between these extremes If this machinery seems complicated to you then take heart because you’re not alone! It wasn’t discovered until just before the first satellites were sent into space and now crops up in third-year undergraduate maths courses. The Voyager probes 10 Figure 6: This image of lightning storms in Jupiter’s atmosphere came back courtesy of RM(5,1). launched in 1977 use the RM(5,1) code, and to this day continue to deliver information as they exit the solar system some 10 billion miles away. The decoding procedure (automatically finding which legitimate codeword is closest to a corrupted received string) is a little trickier to implement than for the Hamming code. Note, though, that it is encryption that must be simple, straightforward and economical. Once the precious data has reached earth we can throw the kitchen sink at decoding and take as much time as we wish, in principle. Back to Earth With a Bang When sending data from Voyager 1 to Houston, the code must be simple and robust but the decoder can be as complex as required. On the other hand, the encoding system which produces compact discs (in a Sony factory, say) can be very expensive, whereas the decoder (the CD player) must be cheaply massproduced. Data on an audio CD is read at about 10MB per minute, and high sound quality is only achieved if any errors are corrected in real time as the contents are read. In summary then, the decoding algorithm has to be simple and fast. CDs employ a special type of code called a cyclic code. Quite simply this means that for an n-digit code if a1 a2 a3 . . . an 11 Figure 7: Five billion ones and zeros. is a codeword then an a1 a2 . . . an−1 is also a codeword, so by iteration we can “wrap around” an arbitrary number of places. We’ve seen an example if this structure with the parity check code. For the 4-bit version, 1100 −→ 0110 −→ 0011 −→ 1001 are all elements of the codebook. Certain types of cyclic codes are very useful for correcting “burst” errors. This refers to a connected section of data being erroneous. Defects of this kind are obviously common on CDs because of scratches. A particular variant, known as a BCH code, is designed to be able to have an arbitrarily large minimum distance between words, and is ideal for dealing with such problems. Using this, modern CD readers are able to correct in real time a burst of 4000 consecutive errors, which corresponds to about 2.5mm of track. It’s an interesting experiment to take a CD (preferably no longer one of your favourites) and see how much it is possible to black out with a permanent marker before your hi-fi protests! Having scratched the surface (no pun) with a few applications, it is evident how coding theory pervades pretty much every aspect of life involving information or electronics. So I hope you’ll be motivated to go ahead and use the resources at your disposal to delve further into what is a vital, developing area of maths. And when you do — whether it’s buying a book on the subject, researching by downloading webpages, or using your mobile phone to discuss it with a friend — it will all be possible because of error-correcting codes. 12
© Copyright 2026 Paperzz