Mayflower: Improving Distributed Filesystem

Mayflower: Improving Distributed
Filesystem Performance Through
DFN/Filesystem Co-Design
Authors: Sajjad Rizvi, Xi Li, Bernard Wong, Fiodar Kazhamiaka
Denjamin Cassel
Presented: Yihan Li
1
Outline
 Motivation
 Introduction
 Design
Overview
 Replica
and Path Selection
 Evaluation
 Conclusion
2
Motivation



Network is the performance bottleneck

Distributed filesystems are the primary bandwidth consumers

Oversubscribed network architectures

High-performance SSDs
Current distributed filesystems and network control planes are
designed independently

Only use static network information

They are not reciprocally involved in making network decisions
To perform replica selection based on network distance

Does not capture dynamic resource contention

Does not capture network congestion
3
What is Mayflower (I)

Mayflower is co-designed from ground up with a Software-Defined
Networking (SDN) control plane

It consists of three main components


Dataserver

Nameserver

Flowserver
It can perform path selection for other applications

Read requests

Through a public interface
4
What is Mayflower (II)

Dataservers


Nameserver


Performs reads from and appends to file chunks
Manages the file to chunk mapping
Flowserver

To run alongside the SDN controller

Models the path bandwidth of the elephant flows

Performs both replica and network path selection
5
Advantage




It enables both filesystem and network decisions to be
made collaboratively by the filesystem and network
control plane
Mayflower evaluates all possible paths between the client
and all of the replica hosts
It can directly minimize average request completion time
 Expected completion time of the pending request
 Expected increase in completion time of other in-flight
requests
It can determine if read concurrently from multiple
replica hosts
6
Design Overview (I)

Five assumptions
1.
The system only stores a modest number of files
2.
Most reads are large and sequential, and clients often
fetch entire files
3.
File writes are primarily large sequential appends to
files (random writes are very rare)
4.
The workloads are heavily read-dominant
5.
The network is the bottleneck
7
Design Overview (II)

Select both the replica and the network path for the
Mayflower read operations

Estimating current network and make selections

It can work together with existing network managers

It periodically fetches the flow stats from the edge
switches (for avoiding error)

It re-computes an estimate of the path bandwidth (ensure
completion time estimates are accurate)
8
File Read Operation
9
Design Overview (III)

Mayflower provides sequential consistency by default

Mayflower provides linearizability with respect to read
and append requests

Sending the last chunk’s read requests to the primary replica
host

Vast majority of chunks can be serviced by any replica host

Most chunks are essentially immutable

System delays the delete for T time (maximum expiration
period) for consistency
10
Replica-path Selection Algorithm


Based on estimated network state
 Bandwidth
estimations
 Remaining
flow size approximations
Target performance metrics
 Average

job completion times
Must account for the effect on existing flows
 New
flows affect the path selection for already
scheduled flows
11
Problem Statement (I)

Optimization goal
 Select
the network path that minimize the completion
time of both the new flow as well as existing flows

The algorithm considers
1.
The paths of existing flows
2.
The capacity of each link
3.
The data size of each request
4.
The estimated bandwidth shares of existing flows
5.
The remaining un-transferred data size of existing
flows
12
Problem Statement (II)
G: paths from source to
destination
ci,j: cost of impact on existing
flows
bi,j: bottleneck bandwidth
di,j: data flow
Ii,j: binary indicator
S: super source
t: sink node
x: data size
13
Replica-Path Selection Process (I)
14
Replica-Path Selection Process (II)

The first portion: estimates the cost of the new flow

The second portion: estimates the impact of the flow on
existing flows Fp in path p

Bandwidth share of the existing flows: max-min fair share
calculations

Unknown flow size: use an estimate size (average elephant flow
size)

Slack in updating bandwidth utilization:

the bandwidth utilization for the new flow is set to its
estimated bandwidth share

Existing flows: updated with their new estimated values
15
Replica-Path Selection Process (III)
16
Replica-Path Selection Process (IV)

Reading from multiple replicas for reducing the
completion time
Total cost
Size of sub-flow
bandwidth
17
Evaluation (I)

Experimental Setup

13 machines

64 GB RAM

200 GB Intel S3700 SSD

Two Intel Xeon E5-2620

Mellanox SX6012 switch via 10 Gbps links

64 virtual hosts, four pads (each with three physical machines)
18
Traffic Matrix

Job arrival follows the Poisson distribution

File read popularity follows the Zipf distribution with the skewness parameter
equals to 1.1

R: a client is placed in the same rack as the primary replica

P: in another rack but in the same pod

O = 1 – R – P: in a different pod
Pod 1
Pod 2
Pod 3
Pod 4
Rack 1
Rack 2
Rack 3
Rack 4
19
Evaluation (II)
Paths to all replicas are partially congested
20
Evaluation (III)
21
Evaluation (IV)
Mayflower is effective at
avoiding congestion point
22
Evaluation (V)
23
Evaluation (VI)
Read requests
Interdependence between the network and the applications
Background flow
24
Evaluation (VII)
25
Evaluation (VIII)
26
Conclusions

How Mayflower improves read performance
 Distributed
filesystem that follows a
network/filesystem co-design approach

It provides a novel replica and network path selection
algorithm

Evaluation
27