Coding Theory and applications to Distributed Computing

Coding Theory
and applications to
Distributed Computing
Carl Bosley
10/29/2007
Overview
• Introduction to ECC’s
– History
– Examples
– Current Research
• Multicast with Erasure Codes
Noisy Channel
C(x)
x
•
Mapping C
– Error-correcting code (“code”)
– Encoding: x → C(x)
– Decoding: y → x
– C(x) is a codeword
y = C(x)+error
x
A Noisy Channel
• Aoccdrnig to a rscheearch at
Cmabrigde Uinervtisy, it deosn't mttaer
in waht oredr the ltteers in a wrod are,
the olny iprmoetnt tihng is taht the frist
and lsat ltteer be at the rghit pclae. The
rset can be a toatl mses and you can
sitll raed it wouthit porbelm. Tihs is
bcuseae the huamn mnid deos not raed
ervey lteter by istlef, but the wrod as a
wlohe.
Communication
• Internet
– Checksum used in
multiple layers of
TCP/IP stack
• Cell phones
• Satellite broadcast
– TV
• Deep space
telecommunications
– Mars Rover
“Unusual” applications
• Data Storage
– CDs and DVDs
– RAID
– ECC memory
• Paper bar codes
– UPS (MaxiCode)
• ISBN
Codes are all around us
Other applications of codes
• Applications in theory
– Complexity Theory
• Derandomization
– Cryptography
– Network algorithms
• Network Coding
The birth of coding theory
• Claude E. Shannon
– “A Mathematical Theory of
Communication”
– 1948
– Gave birth to Information theory
• Richard W. Hamming
– “Error Detecting and Error Correcting
Codes”
– 1950
The fundamental tradeoff
• Correct as many errors as possible while
using as little redundancy as possible
– Intuitively, contradictory goals
The Binary Symmetric Channel
1-p
0
1
0
p
E: {0,1}k Æ {0,1}n
p
D: {0,1}n Æ {0,1}k
1
1-p
Each bit sent is received
correctly with probability p,
and incorrectly with
probability 1-p. Errors are
independent.
k = Rn
R < 1 is called the rate of the
source.
Notation
• Hamming Distance:
– For x, y in Σn, d(x,y) = # coordinates s.t. xi ≠ yi.
– wt(x) = d(x,0)
• Entropy:
– H(p) = -Σpi log pi = -(p log p + (1-p) log (1-p)).
• Capacity of the Binary Symmetric Channel is:
– C(p) = 1 – H(p).
• Exist codes which achieve rate R arbitrarily
close to C(p).
Some Terminology
• C = (n,k,d)q code:
– n = block length
– k = information length
– d = distance
– k/n = (information) rate
– q = alphabet size
• Often, it is convenient to think of Σ as a
finite field of size q.
– (“allows multiplication and addition”)
Basic Questions in Coding Theory
• Find optimal tradeoffs for n, k, d, q
• Usually, q is fixed, and we seek C.
– Given n,k,q, maximize d
– Given n,d,q, maximize k
– Given rate, minimize n
Some main types of Codes
Code type
ReedSolomon
Input
Encoding Decoding
interpretation
Polynomial f
{f(xi)}
Interpolation.
Berlekamp-Welch
Linear
Vector v
{v . xi}
Matrix operations.
Trivial Example Codes
• Repetition
– 0 Æ 000000, 1 Æ 111111
• d = n/2, k = 1
• Parity
– Append parity to message
– 000 Æ 0000, 001 Æ 0011, …, 111 Æ 1111
• N = k+1, d = 2
Linear Codes
• If Σ is a field, then Σn is a vector space
• C can be a linear subspace
– Called [n,k,d]q code
• Short representation
• Efficient encoding
• Efficient error detection
Linear Codes are Nice
•Generator Matrix:
–k × n matrix G s.t. C = {xG | x \in Σk}
•Parity Check Matrix:
–n × (n-k) matrix H s.t. C = {y \in Σn | yH = 0}
Examples
• Hamming Code
– [n = (qt – 1)/(q-1), n-t, d=3]q
– Rows of H: all nonzero vectors of length t
• Hadamard Code
– Dual of the Hamming code
– {mx | x \in Σk}
– [n=qt, k = t, d = qt – qt-1]q
Hamming (7,4) code
Hamming (7,4) code
Encoding
Decoding Example 1
Decoding Example 2
Using the Parity Check Matrix
Codes for Multicasting: Introduction
• Everyone thinks of data as an ordered
stream. I need packets 1-1,000.
• Using codes, data is like water:
– You don’t care what drops you get.
– You don’t care if some spills.
– You just want enough to get through the pipe.
– I need 1,000 packets.
Erasure Codes
n
Message
Encoding Algorithm
cn
Encoding
Transmission
≥n
Received
Decoding Algorithm
n
Message
Application:
Trailer Distribution Problem
• Millions of users want to download a new
movie trailer.
• 32 megabyte file, at 56 Kbits/second.
• Download takes around 75 minutes at
full speed.
Point-to-Point Solution Features
• Good
– Users can initiate the download at their discretion.
– Users can continue download seamlessly after
temporary interruption.
– Moderate packet loss is not a problem.
• Bad
– High server load.
– High network load.
– Doesn’t scale well (without more resources).
Broadcast Solution Features
• Bad
– Users cannot initiate the download at their discretion.
– Users cannot continue download seamlessly after
temporary interruption.
– Packet loss is a problem.
• Good
– Low server load.
– Low network load.
– Does scale well.
A Coding Solution: Assumptions
• We can take a file of n packets, and
encode it into cn encoded packets.
• From any set of n encoded packets, the
original message can be decoded.
Coding Solution
5 hours
4 hours
Encoding
Copy 2
3 hours
Encoding
File
2 hours
Encoding
Copy 1
1 hour
Transmission
User 1
Reception
User 2
Reception
0 hours
Coding Solution Features
• Users can initiate the download at their discretion.
• Users can continue download seamlessly after
temporary interruption.
• Moderate packet loss is not a problem.
• Low server load - simple protocol.
• Does scale well.
• Low network load.
So, Why Aren’t We Using This...
• Encoding and decoding are slow for large
files -- especially decoding.
• So we need fast codes to use a coding
scheme.
• We may have to give something up for fast
codes
– Such codes were only recently developed.
Performance Measures
• Time Overhead
– The time to encode and decode expressed as
a multiple of the encoding length.
• Reception efficiency
– Ratio of packets in message to packets
needed to decode. Optimal is 1.
Reception Efficiency
• Optimal
– Can decode from any n words of encoding.
– Reception efficiency is 1.
• Relaxation
– Decode from any (1+ε) n words of encoding
– Reception efficiency is 1/(1+ε).
Parameters of the Code
n
Message Length
cn
Encoding Length
(1+ε)n
Reception efficiency is 1/(1+ε)
Previous Codes
• Reception efficiency is 1.
– e.g. Standard Reed-Solomon
• Time overhead is number of redundant packets.
• Uses finite field operations.
– Fast Fourier-based
• Time overhead is ln2 n field operations.
• Reception efficiency is 1/(1+ε).
– Random mixed-length linear equations
• Time overhead is ln(1/ε)/ε.
Tornado Code Performance
• Reception efficiency is 1/(1+ε).
• Time overhead is ln(1/ε).
• Fast and efficient enough to be practical.
Codes: Other Applications?
• Using codes, data is like water.
– What more can you do with this idea?
• Example: Parallel downloads:
Get data from multiple sources, without
the need for co-ordination.
Recent Improvements
• Practical problem with Tornado code:
encoding length
– Must decide a priori -- what is right?
– Encoding/decoding time/memory
proportional to encoded length.
• Luby transform:
– Encoding produced “on-the-fly” -- no
encoding length.
– Encoding/decoding time/memory
proportional to message length.
Coding Solution
5 hours
4 hours
3 hours
File
Encoding
2 hours
1 hour
Transmission
User 1
Reception
User 2
Reception
0 hours
Additional Resources
• Tornado codes
– Slides: http://www.icsi.berkeley.edu/~luby/PAPERS/tordig.ps
– Paper: http://www.icsi.berkeley.edu/~luby/PAPERS/losscode.ps
• Network Coding
– Combination of coding theory and graph theory
• Goal of coding theory: achieve capacity on a channel
• Goal of network coding: achieve capacity on a network
– See [NC], [EXOR], [COPE] papers available at
• http://www.news.cs.nyu.edu/~jinyang/fa07/schedule.html
– More links at
• http://www.ifp.uiuc.edu/~koetter/NWC/index.html

Download Report

Coding Theory and applications to Distributed Computing

Paperzz.com

Your Paperzz