Remote Reference Counting: Distributed Garbage

Remote Reference Counting
Distributed Garbage Collection with Low
Communication and Computation
Overhead
www.cs.technion.ac.il/~assaf/publications/gc.ps
Distributed Systems
• Consist of nodes:
– Lowest level: local address space
– Next level: disk partition, processor
– Top level: local net
• Interaction through message passing
• Failures:
– Due to hardware or software problems
– Disconnection: due to network overload,
reboot...
Distributed GC
• Motivations:
– Transparent object management
– Storage management is complex - not to be
handled by users
• Goals:
– Efficiency
– Scalability
– Fault tolerance
Distributed GC
• The main problem:
– A section of GC code running on one node
must verify that no other node needs an object
before collecting it
• Result:
– Many modules must cooperate closely, leading
to a tight binding between supposedly
independent modules
Distributed GC
• Problems with simple approaches:
– Determining the status of a remote node is
costly
– Asynchronous systems  inconsistent data
– Failures
Remote References
• Terminology:
– Owner - node which contains the object
– Client - node which has a reference to the object
• Creation:
– A reference to an object crosses node boundaries
– Side effect of message passing
• Duplication:
– Client of a remote object sends to a receiver node a
reference to that object
Naive Reference Counting
• Keep a reference count for each object
• Upon duplication or creation, inform the
owner to update the counter, by sending him
a control message
• Problems:
– Increases communication overhead
– Loss or duplication of messages
– Race between decrement/increment messages
Race Conditions in Naive
Reference Counting:
Decrement/Increment
&V
RA
RC
X
U
+1
-1
RB
V
Counterv = 1
Race Conditions in Naive
Reference Counting:
Increment/Decrement
&V
RA
X
RC
U
+1
-1
RB
V
Counterv = 1
Avoiding Race by Acknowledge
Messages
&V
RA
X
RC
U
+1
ack
RB
V
Counterv = 1 2
Weighted Reference Counting
• Each object referenced has a partial weight
and a total weight
• Object creation:
– total weight = partial weight = even value > 0
RB
V
Total = 64
Partial = 64
Weighted Reference Counting:
Reference Duplication
partial weight halved and sent with the reference
&V/16
RA
X
RC
U
Partialv = 32 16
Partialv = 16
RB
V
Totalv = 64
Partialv = 32
Weighted Reference Counting:
Reference Deletion
partial weight sent to owner and subtracted from total weight
RA
RC
X
U
-16
Partialv = 16
Partialv = 16
RB
V
Totalv = 64 48
Partialv = 32
Weighted Reference Counting
• Invariant: total weightv =  partial weightv
• When total weight = partial weight there are
no remote references
• Advantage: Eliminates increment messages,
and therefore race conditions
Weighted Reference Counting
• Shortcomings:
– Weight underflow
• Possible solutions:
– Use partial weights which are powers of 2, keep
only the exponent
– [Yu-Cox] “Stop the world”, last resort global trace
– Not resilient to message loss or duplication:
• Loss may cause garbage objects to remain uncollected
• Duplication may cause an object to be prematurely
collected
Indirect Reference Counting
• Stub contains strong and weak locators
– Strong: refers to a scion in the sender node; used
only for distributed GC
– Weak: refers to the node where target object is
located; used to invoke target object in a single hop
• Duplication performed locally without
informing the owner node
– The weak reference is sent along with the message
containing the reference
Indirect Reference Duplication
RA
1
strong
&scionB,
locator
&scion
A
weak
locator
X
VA
stub
RB
1 scion
V
VA
RC
U
Indirect Reference Deletion
RA
1
strong
locator
weak
locator
X
VA
stub
RB
1 scion
V
VA
RC
U
Indirect Reference Deletion
RA
-1
1
X
RC
U
VA
stub
RB
1 scion
V
Indirect Reference Deletion
RA
RC
X
U
VA
stub
RB
1 scion
V
Indirect Reference Counting
• Advantages:
– Unlimited number of duplications
– Access to object in one hop through weak
locator
• Disadvantages:
– Not resilient to message failures
– Messages are sent whenever an object is deleted
Reference Listing
• The object’s owner allocates a table of
outgoing pointers (scions), one for each
client that owns a reference to the object
RA
XB
• Client nodes hold tables of incoming
pointers (stubs)
RC
Y
Z
RB
XB
Cx
stub
scion
Ax
X
object
Use of Timestamps
RC
Y
RB
XB
Cx
stub
scion
X
object
Sent &X/1
Sent delete X/1
Sent &X/2
Ignore
Received delete X/1
d
Reference Listing
• Advantages:
– Resilience to message duplication when timestamps are used
– Resilience to node failure: Owner can prompt client to send
a live/delete message
– Owner may explicitly query about a reference that is
suspected to be part of a distributed garbage cycle
– Owner can decide whether to keep objects referred to by a
crashed client node until it recovers or not
• Disadvantages:
– Memory overhead
– Doesn’t collect cycles of garbage
Remote Reference Counting
• Advantages:
– Depends only on the number of nodes in the
system
• Independent of pointer operations
• Independent of heap size
– Messages are sent only during GC, when the
chance of collecting an object is very high
– Independent of consistency protocols and
global order of operations
Remote Reference Counting
• Disadvantages:
– Doesn’t collect cycles of garbage
– Dependent on the number of nodes in the
system
The System Model
• Communication through a reliable
asynchronous message-passing system
– Messages are never lost, duplicated or altered
– Messages can be delayed or arrive out of order
• Processors can share objects
• Objects can be replicated
Local and Remote Counters
• Local and remote counters are attached to
every shared object
• Locali(X)
– Increased by m when node i receives a message
containing m pointers to X
– Otherwise maintained as in traditional reference
counting
– When Locali(X) = 0, i is clean - has no
references to X
Local and Remote Counters
• Remotei(X)
– Increased by m when some object Y containing m
pointers to X is sent from node i
– Decreased by m when some object Y containing
m pointers to X is received at node i
– The sum of Remotei(X) is the number of
pointers to X in transit in the system
The Algorithm - Layout
• Build a spanning tree covering all the nodes
• Collection of object X:
–
–
–
–
The root send signals to all its children
Inner nodes pass the signal down
When a leaf is clean it sends up a token
An inner node sends up a token when it received
tokens from all its children and is clean
– When the root received tokens from all its children it
checks a condition C:
• If C = true  X is garbage
• Otherwise - another wave begins
The Algorithm
Signals
a node with local(x) = a
0
The Algorithm
0
1
Tokens
0
1
0
0
a node with local(x) = a
0
0
0
0
0
The Algorithm
R = R0  all the nodes outside S are clean
0
1
S
Tokens
0
1
0
0
0
a node in S - hasn’t sent a token
0
0
0
0
Example: R0 falsification
0
1
S
1
j
Y:=Z
0
0
0
a node in S - hasn’t sent a token
0
0
0
0
Example: R0 falsification
Locali(x) = 1
Remotei(x) = 1
i XZ
0
1
Localj(x) = 2
Z j(x) = -1
Remote
S
1
j
Y:=Z
0
0
0
a node in S - hasn’t sent a token
0
0
0
0
The Algorithm
• Use the remote counter to count pointers sent
and received
• i definition:
– for a node i outside S, i is the value held at
remotei(X) when i sent its token
– for a node i in S, i is the value held at remotei(X)
•  = i
• fin =  at the end of the wave
The Algorithm
• A leaf sends in the token the value of its
remote counter
• An inner node sends up the sums of its
remote counter and those of its descendants
• R1   > 0
• R = R0  R1
Example (cont.)
Locali(x) = 1
Remotei(x) = 1
i XZ
0
1
Localj(x) = 2
Remotej(x) = -1
S
1
j
Y:=Z
0
0
0
0
 = 1  R1 is true
0
0
0
Example: R1 Falsification
Locali(x) = 1
Remotei(x) = 1
i XZ
0
k
Localj(x) = 2
Remotej(x) = -1
S
j
Y:=Z
0
1
W:=Y
0
0
0
0
0
0
Example: R1 Falsification
Locali(x) = 1
Remotei(x) = 1
i XZ
0
k
Localj(x) = 2
Remotej(x) = 0
S
W:=Y
j
Y:=Z
0
1
0
Localk(x) = 2
Remote
Yk(x) = -1
0
0
 = 0  R1 is false
0
0
0
The Algorithm
• Detect if  may have decreased due to a
node in S:
– Initially paint all nodes in white
– A node that decreases remote(X) turns black
• R2  at least one node in S is black
• R = R0  R1  R2
Example: R2 Falsification
Locali(x) = 1
Remotei(x) = 1
0
k
YLocal (x) = 2
k
i XZ
Localj(x) = 2
Remotej(x) = 0
S
j
Y:=Z
0
1
W:=Y
0
0
0
Remotek(x) = -1
0
0
0
Example: R2 Falsification
Locali(x) = 1
Remotei(x) = 1
i XZ
0
k
Localj(x) = 2
Remotej(x) = 0
S
j
Y:=Z
0
1
W:=Y
0
0
0
Localk(x) = 2
Remotek(x) = -1
0
0
0
Example: R2 Falsification
Locali(x) = 1
Remotei(x) = 1
i XZ
Token
0
k
Localj(x) = 2
Remotej(x) = 0
S
1
j
Y:=Z
0
0
0
0
Localk(x) = 0
Remotek(x) = -1
0
0
No node is S is black  R2 is false
0
The Algorithm
• Propagate the color information:
– A node that is black or has received a black
token transmits a black token
– Otherwise, transmits a white token
– A node that transmits a black token becomes
white
• R3  some node in S has a black token
• R = R0  R1  R2  R3
Example (cont.)
Locali(x) = 1
Remotei(x) = 1
i XZ
Token
0
k
Localk(x) = 0
Remotek(x) = -1
Localj(x) = 2
Remotej(x) = 0
S
1
j
Y:=Z
0
0
0
0
0
0
0
The Algorithm
• C = [S = {root}
 root is white and localroot(X) = 0
 all tokens at the root are white
 fin = 0]
• Once the root received tokens from all its
children and localroot(x) = 0 it checks C:
– C = true  object X is garbage
– Otherwise - the root becomes white and
initiates another wave
Correctness Proof
• Layout:
– Show that R = (R0  R1  R2  R3) is invariant
– C = true  (R1  R2  R3) = false 
R0 = true  object X is garbage
R0R1R2R3 is invariant
• Assume by negation R is false
• Look at the wave in which R first became false:
– R = false  R0 = false  some node outside S
was dirty
– i = the first node outside S to become dirty
• Case 1: R became false before i first became dirty
– Implies that some node became dirty before i impossible by definition of i
R0R1R2R3 is invariant
• Case 2: R became false after i first became dirty
– i received a message containing a pointer to X
after sending its token
– case 2.1: the message was sent in a previous wave
• More pointers sent than received   > 0 at the
beginning of the wave
– If  doesn’t decrease  R1 = true
– Otherwise: some node becomes black 
R2  R3 = true
R0R1R2R3 is invariant
– case 2.2: the message was sent in the current
wave
• The message could have been sent only by a node j
with local(X) > 0  inside S
• j increased  after sending the message:
– If  was < 0 before then some node turned black
before i became dirty  R2  R3 = true until the
end of the wave
– Otherwise  > 0 after j increased it  R1 = true
till the end of the wave or until some node
becomes black
Correctness Proof (cont.)
• If the root hasn’t received a black token, S={root},
the root is white and fin = 0, then there are no
messages in transit with pointers to object X
– No node became black during the wave   didn’t
decrease  no messages were sent during the wave by
nodes in S
–   0 at the beginning, fin = 0   = 0 for the duration
of the wave  no message in transit at the beginning of
the wave
– No node outside S can receive a message and become
dirty
Correctness Proof (cont.)
• If the root hasn’t received a black token, S={root},
the root is white and fin = 0, then R0 = true
– R2  R3 = false
– R1 = false
– R is invariant
• If the root hasn’t received a black token,
S={root}, fin = 0 and the root is white and
clean, then object X can be safely reclaimed
– R0 = true - all nodes outside S are clean
– The root is clean
– There are no pointers in transit
Liveness Proof
• RRC doesn’t reclaim cycles
• Unreferenced object - referenced from neither
local memory of any node nor from any traveling
message
Liveness Proof
• If an object is unreferenced, it will finally be
reclaimed by RRC
– For all nodes local(X) = 0
– All nodes will finally send a token
– If at the root C = false another wave begins:
• No messages with pointers to X exist  no node
will turn black
• No pointers to X exist  none will be sent   = 0
• C = true at the end of the wave
Liveness Proof
• If a garbage object is not reachable from any
garbage cycle, it will finally be reclaimed by RRC
X3
X2
X1
X
Liveness Proof
• If a garbage object is not reachable from any
garbage cycle, it will finally be reclaimed by RRC
X2
X1
X
Liveness Proof
• If a garbage object is not reachable from any
garbage cycle, it will finally be reclaimed by RRC
X1
X
Liveness Proof
• If a garbage object is not reachable from any
garbage cycle, it will finally be reclaimed by RRC
X
Distributed Shared Memory
(DSM)
• Software providing an abstraction of shared
memory, running on networked
workstations
• Workstation’s memory act as cache
• No messages exchanged - data shared
through virtual shared memory
Millipage DSM
• Implements MULTIVIEW
– Enables fine-grained sharing in page-based DSMs
– Eliminates false sharing
• Each object is mapped to a different virtual page,
called minipage
– One node is the manager of the minipage
• handles page faults - read/write requests
• invalidation of a minipage = discarding it from local
memory
– Current version implements sequential consistency
RRC Message Waves
• A global tree is build during initialization
• A wave begins when the local counter at the root
becomes 0
• Communication may be asynchronous - RRC
message can be delayed and sent with other RRC
or DSM messages
• Discard messages are sent only in memory reuse
Example
i
Remotei(Y) = 1
Locali(Y) = 1
P
X
Y
p1
Remotei(X) = 2
Locali(X) = 1
P
P
Read(X)
k
P
Remotek(X) = -1
Localk(X) = 1
j
P
Remotej(X) = -1
Localj(X) = 1
Example
i
Remotei(Y) = 1
Locali(Y) = 1
P
X
Y
p1
Minipage
X
k
P
Remotek(X) = -1
Localk(X) = 1
Remotei(X) = 2
Locali(X) = 1
j
P
Remotej(X) = -1
Localj(X) = 1
Example
i
Remotei(Y) = 1
Locali(Y) = 1
P
X
Y
p1
Remotei(X) = 2
Locali(X) = 1
j
k
P
Remotek(X) = -1
Localk(X) = 1
P
Remotej(Y) = -1 X
Localj(Y) = 1
Remotej(X) = -1
Localj(X) = 1
Example
i
P
Z
Remotei(Y) = 2
Locali(Y) = 1
Y
X
p1
Remotei(X) = 2
Locali(X) = 1
Remotek(Y) = -1
Localk(Y) = 1
j
k
X
P
Remotek(X) = -1
Localk(X) = 1
P
Remotej(Y) = -1 X
Localj(Y) = 1
Remotej(X) = -1
Localj(X) = 1
Example
i
P
Z
Remotei(Y) = 2
Locali(Y) = 1
Remotek(Y) = -1
Localk(Y) = 1
k
X
P
Remotek(X) = -1
Localk(X) = 1
Y
X
p1
PageInvalide(X)
Remotei(X) = 2
Locali(X) = 1
j
P
Remotej(Y) = -1 X
Localj(Y) = 1
Remotej(X) = -1
Localj(X) = 1
Example
i
P
Z
Remotei(Y) = 2
Locali(Y) = 1
Y
X
p1
Remotei(X) = 2
Locali(X) = 1
Remotek(Y) = -1
Localk(Y) = 1
j
k
X
P
Remotek(X) = -1
Localk(X) = 1
P
Remotej(Y) = -1 X
Localj(Y) = 1
Remotej(X) = -1
Localj(X) = 1
Example
i
P
Z
Remotei(Y) = 2
Locali(Y) = 0
Remotek(Y) = -1
Localk(Y) = 0
p1
Remotei(X) = 2
Locali(X) = 1
Signals
j
k
X
Y
X
P
Remotek(X) = -1
Localk(X) = 1
P
Remotej(Y) = -1 X
Localj(Y) = 0
Remotej(X) = -1
Localj(X) = 1
Example
i
P
Z
Remotei(Y) = 2
Locali(Y) = 0
Remotek(Y) = -1
Localk(Y) = 0
p1
Remotei(X) = 2
Locali(X) = 1
Tokens
j
k
X
Y
X
P
Remotek(X) = -1
Localk(X) = 1
P
Remotej(Y) = -1 X
Localj(Y) = 0
Remotej(X) = -1
Localj(X) = 1
Performance Evaluation
• The system:
–
–
–
–
8 Pentium II 300 MHz
Windows NT Workstation 4.0 SP3
128 Mbytes RAM
Workstations interconnected by a switched Myrinet LAN
• Benchmarks:
– Allocate objects and don’t free them
– Executed a number of times in a non-stop manner
Benchmarks
• Water - a parallel application from the field of molecular
dynamics
• LU Decomposition - factors a dense matrix A into the
product of a lower triangular matrix L and an upper triangular
matrix U
• Integer Sort - sorts N integer values in parallel
• Successive Over-Relaxation - input: a two dimensional grid.
In each iteration every grid element is updated to the average
of its four neighboring elements
• Traveling Salesman Problem - find the minimum-cost,
simple, cyclic tour in a weighted graph.
Application Suite
Water
No. of Runs
Shared Memory
LU
IS
SOR
TSP
3
30
10
3
5
1MB 240MB 160KB 24.6MB 4MB
No. of Objects
1542
Garbage Creation Rate
27
(obj/sec)
Speedup on 8 Nodes
6.5
510
80
6150
25025
57
2
5082
1321
4.9
7.4
7.0
5.5
RRC Communication Cost
1-2 waves are enough to detect an object as garbage
RRC Communication Cost
• Communication complexity is independent of
the number of pointer operations
– Simulation of different rates of pointer operations
showed no change in the number of GC messages
• Efficiency relies on 2 observations:
– Object use is usually localized in time
– The node that created the object is usually the last
to use it
RRC Communication Cost
• To improve performance:
– Tokens and signals can be combined or
piggybacked on other messages
– RRC can be turned off or delayed when best
performance is desired
Scalability
• Problem: GC waves span all the processes in
the system
• Increase less than linear, also due to:
– Increased garbage creation rate
– Increased number of page faults
– Increased number of “discard” messages
• GC overhead in a single node is independent
of the number of signals and tokens sent
Scalability
Scalability
Collection in Granularity
Larger than Objects
• Expected to decrease the number of GC
messages
• Tested on SOR:
– A single minipage contains several objects
instead of only one
Collection in Granularity of
Pages
Collection in Granularity of
Pages
• Advantages:
– Reduction in memory overhead
– Easier organization of the free list
– Cycles contained entirely within the page are collected
• Disadvantages:
– Delay in reclamation
• Not a significant problem according to the memory
locality principle
– Creation of false cycles
CPU Time - Root
CPU Time - Inner Node
RRC - Conclusions
• A GC algorithms that works correctly in a reliable
asynchronous message passing distributed system
• Successfully implemented as a WIN32 library on
Windows-NT on top of MILLIPAGE
• 2-3 messages to identify a garbage object,
independent of reference graph mutations
• Use of a reference counting technique insures low
computational overhead
RRC - Conclusions (cont.)
• Scalable - the number of GC messages sent by a
single node is independent of the number of nodes
• Improvement in communication overhead with
increase in collection granularity
• Unable to collect cycles