On the Duality of Data-intensive File System Design: Reconciling

On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS
Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson (CMU) - Seung Woo Son, Samuel J. Lang, Robert B. Ross (ANL)
Overview
Experiment Setup
Internet Services
• Distributed file system
• Purpose-built for anticipated workloads
• Hadoop & Hadoop distributed file system (HDFS)
• Use triplication for reliability
• Use file layout to collocate computation and data
• OpenCloud cluster - 51 nodes (8-core 2.8 GHz, 16GB
DRAM, 4 SATA disks, 1 used in experiments, 10 GE)
• Benchmarks
• Data-set: 50 million 100-byte records (50GB)
• Workload: write, read, grep (for a rare pattern), sort
• Applications
• Sampling (B. FU): Read 71GB astronomy data-set
• FoF (B. FU): Cluster & join astronomical objects
• Twitter (B. Meeder): Reformat 24GB to be 56GB
High performance computing (HPC) [e.g. PVFS]
• Equally large scale applications
• Parallel file system
• Concurrent reads and writes
• Typically support POSIX and VFS interface
Experiment Results
N Clients write N 1GB les (TmpFS)
Apps
Apps
Apps
Hadoop/MapReduce framework
File system extensions API
(org.apache.hadoop.fs.FileSystem)
2,000
1,500
1,000
500
0
libpvfs
to HDFS/PVFS servers
HDFS/PVFS
server
Local FS
server
net
from
client
Local FS
net
500
Server
3
Offset
Server Server
0
1
(Writer)
Server
2
Server
3
Offset
Server Server
0
1
(Writer)
Server
2
Server
3
0M
Blk
0
Blk
0
Blk
1
Blk
0
0M
Blk
0
Blk
0
Blk
0
Blk
1
0M
Blk
0
Blk
0
Blk
0
Blk
1
64M
Blk
1
Blk
2
Blk
2
Blk
1
64M
Blk
1
Blk
1
Blk
2
Blk
2
64M
Blk
1
Blk
1
Blk
2
Blk
2
128M
Blk
2
Blk
3
Blk
4
Blk
3
128M
Blk
2
Blk
3
Blk
3
Blk
3
128M
Blk
2
Blk
3
Blk
3
Blk
4
192M
Blk
3
Blk
4
192M
Blk
4
Blk
4
Blk
4
192M
Blk
3
Blk
4
256M
Blk
4
256M
Blk
4
HDFS Random Layout
256M
PVFS Round-robin Layout
0
0
5 10 15 20 25 30 35 40 45 50
Number of clients
HDFS Random
PVFS Round-Robin
PVFS Hybrid
PVFS Hybrid (4 streams)
200
PVFS w/ readahead bu er
100
PVFS w/ readahead bu er and layout
Grep Benchmark
• By using both readahead buffer and file layout
information, PVFS performance is comparable to HDFS
Completion Time (seconds)
Offset
Server
2
1,000
Vanilla PVFS
300
Data servers
PVFS Shim responsibilities:
• Readahead buffer: reads from PVFS in 4MB requests
• File layout: file layout exposed as extended attributes
• Replication: triplicates data in one PVFS file
Server Server
0
1
(Writer)
2,000
HDFS
400
0
MDS
3,000
600
HDFS/PVFS
to
server
PVFS Hybrid
4,000
• HDFS is surprisingly tied to disk performance even
without any explicit sync
rep
Completion Time
(seconds)
libhdfs
map
5 10 15 20 25 30 35 40 45 50
Number of clients
HDFS Random
PVFS Round-Robin
PVFS Shim layer
buf
0
5,000
150
100
50
0
write
HDFS Random
read
grep
Benchmark
PVFS Hybrid
sort
PVFS Round-Robin
Completion Time (seconds)
Java
application
Aggregate write throughput (MB/s)
PVFS plug-in under Hadoop stack
2,500
Aggregate write throughput (MB/s)
N clients write N 1GB les (Disk)
800
600
400
200
0
sampling
HDFS Random
fof
twitter
Application
PVFS Hybrid
PVFS Round-Robin
• PVFS performance is comparable to HDFS for both
Hadoop benchmarks and scientific applications
PVFS Hybrid Layout
HDFS/PVFS data layout schemes:
• HDFS Random: 1 copy on writer’s disks, 2 copies random
• PVFS Round-robin: 3 copies striped in file
• PVFS Hybrid: 1 copy on writer’s disks, 2 striped
Acknowledgements: Robert Chansler, Tsz Wo Sze,
Nathan Roberts, Bin Fu, and Brendan Meeder
Conclusion
• With a few modifications in a non-intrusive shim layer,
PVFS matchs performance for Hadoop applications
• File layout information is essential for Hadoop to
collocate computation and data