Data Persistence in Large-scale Sensor Networks with

Data Persistence in Large-scale
Sensor Networks with
Decentralized Fountain Codes
Yunfeng Lin, Ben Liang, Baochun Li
INFOCOM 2007
1
Outline



Introduction
Preliminaries
Persistent Data Access





Two-way random walks
EDFC and ADFC
Discussion of Multiple Encoded Blocks
Performance Evaluation
Conclusion
2
Introduction (1/5)

It has been a conventional assumption that
measured data in individual sensors are
gathered and processed at powered sinks.



Internet Connections
via Data Aggregation
This assumption may not realistically hold.


large-scale sensor networks
inaccessible geographical regions
3
Introduction (2/5)

Our proposed vision is:


Ask the sensors to collaboratively store
measured data over a historical period of time.
After a later time of convenience, a collector
collects such measured data directly from the
sensors.
PUSH Model
Sensors send data periodically.
PULL Model
Sensors are passively polled
by the collector.
4
Introduction (3/5)

We propose a novel decentralized implementation
of fountain codes in sensor networks.




Data can be encoded in a distributed fashion.
A sensor disseminates its data to a random subset
of sensors in the network.
Each sensor only encodes data it has received.
The collector is able to decode original data by
collecting a sufficient number of encoded data
blocks.
5
Introduction (4/5)

Our decentralized implementation of
fountain codes does not require the support
of a generic layer of routing protocols.


Do not need Routing Table or Geographical
Routing Protocols.
Use random walks to disseminate data.
6
Introduction (5/5)
Caching
Caching
Source Blocks
Encoded Blocks
failure!
Caching
Collector
Sensed Data
Decoding!
Caching
: sensing nodes
: caching nodes
7
Preliminaries
Why Fountain Codes?

Replication



Error-correcting Codes


Implemented in a centralized fashion
Random Linear Codes



backup sensors
But a large number of replicas are required.
decentralized
But the decoding process is computationally expensive.
Fountain Codes


O( K 3 )
Low decoding complexity: O ( K ln K )
superior decoding performance
“Digital Fountain Codes V.S. Reed-Solomon Code For Streaming Applications”
8
S. K. Chang
Preliminaries
LT Codes


In LT codes, K source blocks can be decoded from
2
any subset of K  O( K ln ( K/δ )) encoded blocks.

with probability 1 - 
degree


the number of source blocks used to generate an encoded
block
The degree distribution of encoded blocks in LT
codes follows the Robust Soliton distribution.
9
Preliminaries
LT Codes

Ideal Soliton distribution ρ()
1/K
ρ(i )  
1/i(i-1 )

if i  1
for i  2, 3, ..., K
Let R  c ln ( K/δ ) K
R/iK

τ (i)  R ln ( R/δ ) / K
0


for i  1, ..., K/R-1
for i  K/R
for i  K/R  1, ..., K
Robust Soliton distribution
ρ(i)  τ (i)
μ(i) 
i ρ(i)  τ (i)
10
Preliminaries
LT Codes
Example of Robust Soliton distribution
spike!
K=10000, c=0.2, and
K/R = 41
δ =0.05
The encoded blocks with a degree higher than
K/R are not essential in decoding!
11
Preliminaries
Random Walks on Graphs

We describe random walks in the context of
disseminating a source block.




sensor: node in the graph
The next hop is randomly chosen from the neighbors of
the source node.
A random walk corresponds to a time-reversible
Markov chain.
In this paper, we choose a variant of the Metropolis
algorithm.


a generalization of the natural random walks for the
Markov chain
non-uniform steady-state distribution
12
Preliminaries
Metropolis Algorithm

The Metropolis algorithm computes the
transition matrix. P  Pij 



Steady-state distribution π  (π1 , π 2 , ...)
N (i ) : neighbors of node i
M : maximal node degree in the graph
min( 1, π j /πi )/M

Pij  0
1  j i Pij
if i  j and j  N (i )
if i  j and j  N (i )
if i  j
13
Persistent Data Access
Decentralized Fountain Codes
Caching
degree d
Source Blocks
Caching
request
source
blocks
source Encoded Blocks
blocks
request
Caching
Sensed Data
Sensed Data
Caching
: sensing nodes K
based on two-way random walks
: caching nodes N
14
Persistent Data Access
Decentralized Fountain Codes

We seek to construct decentralized fountain codes
with only one traversal of random walks.




from sensing nodes to the caching nodes
Cache Nodes: Encode and store the source blocks.
Collector: Decode the source blocks.
We propose two heuristic algorithms.


EDFC and ADFC
guarantee the Robust Soliton distribution of LT codes
15
Persistent Data Access
Exact Decentralized Fountain Codes

The randomization introduced by random walks.



Distinct source blocks received by a node is uncertain.
We must disseminate more than d source blocks on each
node.
Redundancy Coefficient: x d


Assume each node receives xd  d blocks.
x d , Pr (receive less than d nodes)
16
Persistent Data Access
Exact Decentralized Fountain Codes

The number of random walks:
N d 1 x d dμ (d )
K
b 

K
Probabilistic forwarding tables:
bK  π d  xd d
πd 
xd d
πd 
bK
xd d
N i 1 xi iμ(i )
K
17
Persistent Data Access
Exact Decentralized Fountain Codes
Source Blocks
degree d
Caching
degree d
Encoded Blocks
degree d
source
blocks
source
blocks
degree d
degree d
Sensed Data
degree d
Collector
Decoding!
Sensed Data
Caching
Caching
: sensing nodes K
π d , forwarding Table, and # of random walks.
: caching nodes N
18
Persistent Data Access
Exact Decentralized Fountain Codes

The steps of EDFC are:
Step 1. Degree generation.
from the Robust Soliton distribution
Step 2. Compute steady-state distribution. π d
Step 3. Compute probabilistic forwarding table.
by the Metropolis algorithm
Step 4. Compute the number of random walks.
b: number of random walks
Step 5. Block dissemination.
based on the probabilistic forwarding table
Step 6. Encoding.
by bitwise XOR of a subset of d source blocks
19
The source node IDs are attached in the encoded block!
Persistent Data Access
Exact Decentralized Fountain Codes


g  NE/K

Overhead ratio
 x d db

1- 

Pr(Yi  1| X  d )  1-(1-π d )bb
K
1
0

Violation
x dμ(d )
d 1 d
K
d 1
dμ(d )
 NE 
Pr(Y  d | X  d )
Probability
 1-e ( -xd d/E)( E/K )  1-e -xd d/K
d
K
K  j-xd d K-j
 Pr(Y  d | X  d )  
Pr(Y  d|X  d )     pe (1-p )
d
j 0  j 
d-1

Optimization Problem:
trade-off between coding performance
and communication overhead
Pr(Y  d | X  Kd )  Pr(Y  d | X  d )

minimize
 x dμ
 K(d)
subject
d
  (1-p ) K-d
Pr(Y dd| X  d )  δd
d 1
xd  1  eK  d - xd d ( K-d )  K  d -x d
 e K
 
   e d
for d  1d, ..., K/R


d 
20
Persistent Data Access
Exact Decentralized Fountain Codes


Solve the optimization problem by MATLAB
Parameter Setting




δ d (constraints of violation probabilities) = 0.05
N (the number of total nodes) = 2000
K (the number of sensing nodes) = 1000
c = 0.01, δ = 0.05


Further numerical computation

overhead ratio = 1.4508
21
Persistent Data Access
Approximate Decentralized Fountain Codes

Design a new distribution υ() to be a hypothetical
chosen degree distribution.

attempt to avoid its redundant random walks
N d 1 dυ(d )
K

Number of random walks b 

Steady-state distribution of the random walks
πd 
d
N i 1 iυ(i)
K
K
E  i 1 iυ(i)
K
22
Persistent Data Access
Approximate Decentralized Fountain Codes

p

d N/K
Pr(Yi  1| X  d )  1-(1)
NE
K
Pr(Y  d' )   Pr( X  d )Pr(Y  d'|X  d )
d 1
K
  υ(d )  p d' (1-p ) K-d'
d 1
 d' 
actual degree distribution of a node
K
υ' ()

Optimization Problem:
K/R
minimize
 (υ' (i) - μ(i))
j 1
2
minimize the mean-square
error between υ' () and  ()
K
subject to
 υ(i)  1
j 1
υ(i )  0 for i  1, ..., K .
23
Persistent Data Access
Approximate Decentralized Fountain Codes

The steps of ADFC are:
Step 1. Degree generation.
from the chosen degree distribution
υ()
Step 2. Compute steady-state distribution. π d
Step 3. Compute probabilistic forwarding table.
by the Metropolis algorithm
Step 4. Compute the number of random walks.
b: number of random walks
Step 5. Block dissemination.
based on the probabilistic forwarding table
Step 6. Encoding.
by bitwise XOR of all received source blocks
24
The source node IDs are attached in the encoded block!
Persistent Data Access
Approximate Decentralized Fountain Codes

Overhead ratio of ADFC


b: the number of random walks in ADFC
b0: the number of random walks in the ideal algorithm
g2

b


b0


K
d 1
K
d 1
dυ(d )
dμ(d )
By further numerical computation



The overhead ratio g 2 is only 0.2326.
Less transmission cost is required.
But…
25
Persistent Data Access
Approximate Decentralized Fountain Codes

Parameter Setting



N (number of total nodes) = 2000
K (number of sensing nodes) = 1000
c = 0.01, δ = 0.05 Robust Soliton distribution
chosen degree distribution υ()
actual degree distribution
inaccuracy!
26
Discussion of Multiple Encoded Blocks
Source Blocks
Cache Node
Source Blocks
Encoded Blocks
……
Source Blocks
may lose some information…
Sensing Nodes

Does it improve the coding performance if
different encoded blocks are maintained?
27
Discussion of Multiple Encoded Blocks

Theorem 2



When the code-degree distribution conforms to the Robust
Soliton distribution, even if the source blocks on each
node are not encoded, the collector must visit Ω(K ) nodes
in order to collect all source blocks with probability 1 -  .
 is a small positive number.
Yi,j is a random variable that assumes the value 1 if the
source block j is collected when visiting ith node.
K
Pr(Yi,j  1)   Pr( X i  d )Pr(Yi,j  1| X i  d )
d 1
K
d c1 ln (K/δ )
  μ(d ) 
K
K
average degree of an
d 1
encoded block
[3]
28
Discussion of Multiple Encoded Blocks

Z j has value 1 if source block j is collected after visiting M
nodes.
Pr( Z j  0)  i 1 Pr(Yi,j  0)  i 1 (1- Pr (Yi,j  1))
M
M
c1 ln ( K/δ ) M
)
K
E denote the event that all blocks are collected after
visiting M nodes.
c ln ( K/δ ) M K
K
Pr( E )   j 1 Pr( Z j  1)  (1-(1- 1
) )
K
All blocks are collected with probability 1 - 
 (1-


(1-(1-
c1 ln ( K/δ ) M K
) )  1-δ
K
29
Discussion of Multiple Encoded Blocks

Apply logarithm to both sides
K ln (1-(1-
-(1
c1 ln ( K/δ ) M
) )  ln (1-δ )  -δ
K
c1 ln ( K/δ ) M
)  -δ/K
K
By using similar approximation, we obtain
M  K/c1
i.e., M  (K )
The collector needs to visit (K ) nodes to collect all K
source blocks.
30
Performance Evaluation

We implement both the original centralized and the
decentralized implementation of fountain codes.


Centralized implementation of fountain codes



To evaluate the effectiveness and performance
about 1000 lines of C++ code
Optimized implementation of encoding and decoding
algorithms.
Decentralized implementation of fountain codes

also simulated in C++
31
Performance Evaluation

Use two-dimensional Geometric Random Graph as
the topological model.





N sensors are uniformly distributed on a unit disk
K sensing nodes are uniformly distributed among the N
sensors.
Radio range: r
We set K=10000, N=20000, and r=0.033 in most
experiments.
The average number of neighbors for each node is
21.
32
Performance Evaluation
Communication Cost and Decoding Ratio

Two main performance metrics


Communication Cost



Communication Cost and Decoding Ratio
the length of random walks
the number of random walks
Decoding Ratio


fault tolerance!
number of nodes need to be visited by a
collector for decoding
Normalized by the number of sensing nodes.
33
Performance Evaluation
Communication Cost and Decoding Ratio
The impact of the length of random walks on decoding ratio.
1.05
50
500
Each Data Point: the average and the 95% confidence interval from 10 experiments
34
Performance Evaluation
Communication Cost and Decoding Ratio
The ratio of dissemination costs of EDFC and ADFC to that
of the two-way algorithm.
0.8
0.2
35
Performance Evaluation
Multiple Encoded Blocks Cannot Do Better

Theorem 2:

Keeping multiple encoded blocks on each node does not
offer any asymptotic performance advantage over keeping
a single encoded block.
The number of nodes to be visited before collecting all source blocks.

The collector needs to visit close to K nodes even if
the source blocks are not encoded.
36
Performance Evaluation
Overestimation of K and N

The failure of sensors are common events.


in large-scale sensor networks
It is not feasible to update K and N to all
nodes in the network whenever they change.


Update K and N periodically.
Each node may overestimate K and N.
37
Performance Evaluation
Overestimation of K and N

The consequence of overestimating N:

N: the number of total nodes
Actual N = 20000.
1.05
38
Performance Evaluation
Overestimation of K and N

The impact of overestimating K:

K: the number of sensing nodes
Estimated K = 10000.
EDFC is more robust!
39
Conclusion

In this paper, we seek to improve the fault tolerance
and data persistence in sensor networks.



Superior decoding performance and low decoding
complexity of fountain codes.


decentralized implementation of fountain codes
disseminate original data throughout the network with
random walks
as the number of nodes scales up
The proposed algorithms are able to provide nearoptimal fault tolerance.

with minimal demand on local storage
40