1
REPLICATING FILES AND
OTHER BIG OBJECTS “OUT OF
BAND” WITH ISIS2
Cornell University
Ken Birman
Core Challenge
2
Many cloud computing systems work with very large
files or other big objects
Frequently
they take the form of massive byte arrays
and it isn’t at all uncommon to “map” them into memory
On Linux and Windows the memory-mapped file API
makes this easy to do.
Takes
a file name and returns a pointer to a memory region
where you can directly access bytes of that file
Not long ago, Isis2 wasn’t good choice
for applications with big objects…
3
We created the OOB layer because moving big objects
inside Isis2 was simply too costly
You can put big things into messages, and Isis2 carves them
into smaller chunks
But they can seriously disrupt steady flow in the system
The issue is that
Isis2 needs to maintain FIFO ordering for lower level
communication between group members
Hence a big object needs to be fully transferred before
small things sent after it can be delivered, even if they were
sent by some other thread for some other reason
Out of Band (OOB) Concept
4
We added a way to move very big byte[] objects
“outside” of the normal Isis2 communication path
We start by assuming the objects are memory-mapped files
(they don’t have to exist at all on disk, but they do have file
names that look like the names of disk files)
You can
Create these from file
Create a big mapped memory region and put data in it
These mapped files can be shared easily within a single
computer and are ultra efficient because no copying
occurs. Much faster than ANY form of copying!
Out of Band (OOB) Concept
5
Before
Machine A has a big
memory object
We want copies on B and C….
But not on D
… After
Keep in mind that “” is really big. And there may be
many such transfers to do, all at the same time
Out of Band (OOB) Concept
6
So…
You’ve
created a memory mapped region
… and put data into it, somehow
… it might be huge (hundreds of megabytes? Or even
gigabytes? No problem! But > 6Gb needs 64-bit O/S)
Our goal: Use Isis2 to efficiently move these from
computer to computer in a cloud computing data
center or a cluster
Ideally:
a single DMA transfer, or a super-efficient series
of ethernet multicasts
Out of Band (OOB) Concept
7
(2) Tell your application on
B and C to fetch X
Isis2
(1) Tell
about X using
OOBRegister
(3) Applications on B and C
call OOBFetch.
(3) Applications
on B and C
call OOBFetch.
(3) OOBReReplicate tells Isis2 to
modify replication pattern
Machine A has a big
memory object
We want copies on machines B and C….
But not on D
Steps
8
First you need to tell the Isis2 subsystem that the file
exists. There are three cases.
1.
2.
3.
Isis2 could be linked directly to your appliction code,
Isis2 could run in a server that you talk to via RPC, perhaps
from native C++.
We also have a command-line program that can talk to
our server for you, so you can access OOB by issuing
commands if the server is running.
Isis2 wants to know the file name. In RPC mode the data
lives in the mapped memory and isn’t copied to Isis2
Steps
9
So.. You
1.
Register the memory-mapped file
Now you can
1.
2.
3.
Form a process group
Replicate data within/among the group members.
We call this “rereplicate” because you can do it
again and again, changing the replication pattern
On the receiving “side”, fetch a pointer into the
memory-mapped file region (this will wait until the
data arrives)
Why do we call it “out of band”?
10
Often you’ll mix Isis2 RPC and multicast with out of
band data transfer
Register
a file, and start transferring it
In parallel, tell some group member(s) about it, by
name
In such cases
Isis2
carries out the OOB transfer as efficiently as it can
The OOBFetch operation in the receiver blocks until the
bytes have been correctly received and are available
Other options
11
You can also register an upcall handler
The
OOB layer will tell it each time an incoming OOB
file has been fully transferred
And you can access for the replication map
It
tells you which group members have which files
Idea is to be able to rereplicate in a flash, in
parallel for multiple files if desired, and as close as
possible to the raw hardware speed of the network
OOB interface
12
(1) Creates a completely new
memory-mapped
object
(2) An “Accessor”
allows you to
access the bytes in the object
Example:
Creating a new mapped file
(3) An example of byte-by-
MemoryMappedFile mmf = MemoryMappedFile.CreateNew(fname, CAPACITY);
byte access.
MemoryMappedViewAccessor mva = mmf.CreateViewAccessor();
for (int n = 0; n < CAPACITY; n++)
{
byte b = (byte)(n & 0xFF);
mva.Write<byte>(n, ref b);
}
You can also open an existing mapped file, if some other
program on the same computer created it
Then call
g.OOBRegister(string fname, MemoryMappedFile mmf)
Now
2
Isis
knows about the file
13
Next we can call ReReplicate:
g.OOBReReplicate(fname, where);
Fname is the file name. But what goes in “where”?
The “where” argument to ReReplicate
14
This should be an object of type List<Address>.
For example, given a view v for a group,
List<Address> everywhere = v.members.ToList();
creates a list with every group member in it.
It must list ALL the places where you want replicas. Isis2
will create new replicas and also delete unwanted ones
Create new replicas before deleting old ones: two steps
OOBDelete(fname) is short for OOBReReplicate with an
empty replica location list.
Now
2
Isis
knows about the file
15
ReReplicate also has a second overload:
g.OOBReReplicate(fname, where, (Action<string, MemoryMappedFile>)
delegate(string oobfname, MemoryMappedFile m) {
IsisSystem.WriteLine("ReReplicate finished for " + oobfname);
});
The delegate method will be called by Isis2 when
the transfer finishes. The transfer itself runs
asynchronously – out of band!
How to access your replica
16
You call
MemoryMappedFile xmmf = g.OOBFetch(fname);
This call will wait until the ReReplication action
finishes (so it is a mistake to do it if you haven’t
started one!). That could take a while if the file is
big: a 5GB file on a 10Gb network will need 5
seconds to transfer even at 100% rate
How our server works
17
We built a very simple server that accepts RPC
requests in Web Services style
Then we created a simple “thin” library to talk to it
You
can pass a file name to it, and it will do an OOB
operation using that file name as the argument
Remember: memory mapped files are accessible from
any program on the same machine!
So Isis2 can access your memory mapped files even
from this server, even if you aren’t “linked” to it!
The command-line API works the same way
Recap: A very fast way to move
objects around
18
(2) Tell your application on
B and C to fetch X
(3) Applications on B and C
call OOBFetch.
(3) Applications
on B and C
call OOBFetch.
(1) Tell Isis2 about X using 2
(3) OOBReReplicate
OOBRegister tells Isis to
modify replication pattern
Machine A has a big
memory object
We want copies on machines B and C
How we use OOB inside
2
Isis
19
One situation where Isis2 has to copy identical data
to lots of group members involves a master/worker
startup with many new members joining
All
the new members need the new group view!
… and because they don’t have the prior group view,
we can’t just send the delta, which is how new view
events normally work
So, if the group is large, Isis2 creates a memorymapped object containing the view, then uses OOB
to transfer it to the joining processes!
You might use it for state transfer too!
20
The initialization case is a form of state transfer
Suppose you are building a group but the state is
very large, like a file service
If you try and transfer the state “in band” it could
take ages and disrupt the group for a long time!
OOB to the rescue!
21
Better: pre-transfer as much state as you can using
the OOB tool
You’ll
need a way to contact the group before even
trying to join. A good option: the Client API
Allows
you to bind to a randomly chosen “representative”
Load balances these roles… Representative must “allow
client requests” to handlers you can call as a client.
So,
you create a state pre-fetch API for clients
Joining
member shows up, perhaps authenticates itself, and
you use OOB to pre-send all that state
But if updates are active…
22
… a race condition forms!
Suppose
the state is A…. W but during the time
between when you finish being a client and join,
updates X and Y occur in the group
Your state is “stale” – should it be discarded?
We recommend:
Associate
a counter or timestamp with the state. The
version you pre-transferred had, perhaps, T=23
Now we can use this to “finalize” the state
Implementation
23
g.Join() has a overload where you can pass in an
long integer. Pass this timestamp
The process that initiates your state transfer will find
the timestamp value in the view, in a field called
v.offset
It can compute a state for you that includes updates
done subsequent to when you pre-transferred state!
OOB pre-transfer idea
24
P
R
Q
Pre-transfer please?
… as Client of G
“look in /tmp/xxx, T=12345”
OOBFetch()
OOBReReplicate
“/tmp/xxx” @ T=123
Create
mapped
file
Memory
Mapped Byte[]
Representation
OOBDelete
g.Join(12345)
R
Updates since T=12345
P
Q
Group obligation?
25
If state of the group is an append-style log, this
concept is easily implemented
Otherwise, group needs to keep a log of “recent”
updates and implement some form of periodic
snapshot in which the stored state has an associated
time (how many updates it reflects), and the log has
the remaining updates
Serialization
26
We have several ways to create the byte[]
representation of these view objects
Msg.ToBArray(objs…)
C#
serialization
Your favorite way of generating a byte[] object
But keep in mind that because an mva isn’t a byte[]
object, copying does occur at the last step of
transforming data into a C# managed object
Performance considerations
27
In theory, the very best way to move the bytes is
with Ethernet multicast or Infiniband
Isis2
supports both… but they behave differently
Ethernet multicast is highly efficient from 1:n, but the
data still is copied from kernel to user address space
Infiniband multicast doesn’t work well, hence we use
Infiniband “verbs” to send the data via multiple 1:1
streams. But these avoid any kernel/user copying
Worst performance: ISIS_UNICAST_ONLY case
© Copyright 2026 Paperzz