MSsys - CS

Gossip Algorithms
and
Implementing
a Cluster/Grid Information service
MsSys Course
Amar Lior and Barak Amnon
Agenda
• A short introduction to
gossip algorithms
• Cluster/Grid Information
services requirements
– How good is old
information
• The distributed bulletin
board model
• Implementation
2
A Problem
• In an n node system assume that every
pair of nodes can communicate directly
• node i wishes to send a message
(rumor, color) to all other nodes.
• Possible deterministic solutions
–BROADCAST (only in a broadcast medium)
–Defining a static tree between the nodes and
sending the message along the edges of this
tree
3
A Gossip Style solution
• Starting with the round in which a rumor is
generated
• each node that holds the rumor selects
another node independently and uniformly
at random
• send the rumor to this node
• The distribution of the rumor is terminated
after some fixed number of O( ln n ) rounds
• At this point all players are informed with
high probability
4
Uniform Gossip Example
1
t
5
Uniform Gossip Example
2
t
6
Uniform Gossip Example
3
t
7
Uniform Gossip Example
4
t
8
Uniform Gossip Example
5
t
9
Gossip benefits
• Robustness to the presence of node failures
– Messages will continue to propagate due to the
random selection of destination
– F nodes failure results in only O(F) uninformed
players
• Simplicity
– All nodes run the same algorithm
• Scalability
– The number of massages each nodes send (and
possibly receive) each round is fixed
10
Gossip taxonomy
• Other names are
– Epidemic algorithms (demers et al)
– Randomized communication (Karp et al)
• Propagation can be done by
– Push – sending the information from the node to the
selected node
– Pull – the other way around
– Push&Pull both ways
• We distinguish between 2 conceptual layers
– A basic gossip algorithm
» by which nodes choose other nodes for communication
– A gossip-based protocol
» Built on top of a gossip algorithm
» Determine the content of the messages that are sent
» The way received messages cause nodes to update their
internal state
11
Rumor speeding bounds
From a single node to all
• Time complexity:
O (ln n)
• Message complexity (Karp el al) lower bound
to the number of messages:
(n ln ln n)
12
Spatial Gossip (Kampe at al)
• New information is most
interesting to nodes that are
nearby
• Combines the benefits of
– Uniform gossip
– Deterministic flooding
• The gossip algorithm chooses the
nodes according to
px, y  cx (d  1) D
• New information is spread to nodes
at distance d with high
probability,in :
O(log 1 d )
13
Aggregating values
• Gossip can also be used to aggregate a value
over all nodes
• Average, maximum, minimum …
• In this case the question is how fast the local
value in each node converge to the desired
value
14
Cluster/Grid Information services
• Basic properties of Grid environment
– Information sources are distributed
– Individual sources are subject to failure
– Total number of information providers is large
– Both the types of information sources and the
ways it is used can be varied
• We cannot in general provide users with
accurate information: any information
delivered to a user is “old”
– How useful is old information? (Mitzenmacher)
– How to build an information service with
guaranteed age properties?
15
Distributed Bulletin board
• The system
– Consists of ‘N’ nodes (or clusters)
– Distributed
– Nodes are subject to failure
• Each node maintains a data structure that
holds an entry on selected (or all) nodes in
the system
• We refer to this data structure as “The vector”
• Each vector entry holds:
– state of the resources (static and dynamic) about
the corresponding node
– age of the information (tune to the local clock)
• The vector is a distributed bulletin board that
serves information requests locally
16
Algorithm 1- Information dissemination
• Each time unit
– Update local information
– Find all vector entries
which are up to age t
– Choose a random node
– Send the above entries to
that node
• Upon receiving a
message
– Compute the received
entries age
– Update the entries which
the newly received
information is fresher
A:1
C:2
D:4
A:1
B:12
C:2
B:1
C:3
E:3
A:4
B:12
C:2
D:4
E:11
D:4
E:11
17
Algorithm 1 : t=2
1
t
18
Algorithm 1 : t=2
2
t
19
Algorithm 1 : t=2
3
t
20
Algorithm 1 : t=2
4
t
21
Algorithm 1 : t=2
5
t
22
Bounds and Approximations
• We want to know “how old” is the information in the
vector
• First we find E(Xt) (for the asynchronous case)
– The expected number of nodes that have information about
node i which is up to t time unit old
E[ X t ]  n 
e
1
(1 ) t
n
n 1  e
ESynchronous
[ X t ]  case
e
t
1
(1 ) t
n
E[ X t ]  2
t
23
Bounds and Approximations
• An approximation for the expected age of the vector
n 1
n 1
Av 
(
 Aw )
n
E[ X t ]
24
Real results
25
Approximating the age distribution
• Ak is a random variable describing the number of
nodes which are up to age k
k t
 E[ X k ]
E[ Ak ]  
k  Aw
) k t
n(1  q
26
Age distribution
27
Handling inactive nodes
• The presence of
inactive nodes causes
problems
– Age quality of the
information deteriorate
– Number of ARP
broadcasts increase
linearly
• Using a fixed size
window improves the
age quality but the
number of ARP
broadcasts stay the
same
28
Algorithm 2
• Algorithm 2 solves the above 2 issues
• Works basically the same as algorithm 1 with the
following difference when sending a message
– Calculate l the number of active nodes
(from the local vector)
– Generate a random number between k=0…l
– If K=0 send the window to all nodes
– Else send the window only to the active nodes
• Using Algorithm 2 the maximal expected number of
messages to inactive nodes ≤ 1
– From all nodes at each round
29
Algorithm 2 – Age performance
30
Algorithm 2 – minimizing messages to
inactive nodes
1
t
31
Algorithm 2
2
t
32
Algorithm 2
3
t
33
Algorithm 2
4
t
34
Supporting Urgent information
• In previous algorithm information is propagated from
all nodes constantly
• In some cases we wish to send an important
message urgently to all
– such as the detection of a newly dead node
– In this case the source node give the message high priority
2*log(n)
• When a node assemble the window it is about to
send it takes the entries with the highest priority and
only then the younger entries
• The priority of an entry is decremented every time
unit
• The result is that urgent messages are disseminated
in O(log(n)) steps
• And regular information is disseminated a bit slower
35
Information service clients
•
MOSIX
–
load balancing
» Fresh information is used by the load
balancing algorithm to consider
migrating processes
–
mmon, Mosix Monitoring tool
» Presents the vector of a specific node
» mmon –h xil-10
•
MPICH
–
Improved assignment of processes
to nodes
» No assignment to “dead” nodes
» Assignment to the least loaded ones
•
Nagios
–
–
•
Colleting information about clusters
over time (history)
Periodically retrieving a vector from
a machine and keeping it
Decision algorithms in the cluster
level
–
–
Leader election (queue fault
tolerance)
Node reservation
36
Conclusions
• Constructed a distributed bulletin board
– Age properties are guaranteed
– The administrator can configure it to the desired
properties
– No two nodes have the same view of the system
– Information requests are served locally
– Noise level (messages to inactive) is constant
– Urgent messages are propagated quickly
37
Future Work
• Investigating other gossip models
– Push and Pull-Push
• Using only a partial view of the system
38