Scaling Spark on Lustre

Scaling Spark on Lustre
Nicholas Chaimov (Univesity of Oregon)
Costin Iancu, Khaled Ibrahim, Shane Canon (LBNL)
Allen D. Malony (University of Oregon)
[email protected]
[email protected]
[email protected]
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
My Background
q 
Main area of research in HPC parallel performance
measurement and analysis
❍ 
q 
TAU Performance System
Research interests
Performance observation (application + system)
❍  Introspection
❍ 
◆  runtime performance observation and query
In situ analysis
❍  Feedback and adaptive control
❍ 
q 
Performance optimization across whole system
❍ 
q 
Based on behavior knowledge and performance models
This work supported by IPCC grant to LBL for data
analytics (Spark) on Lustre
❍ 
Nick’s summer internship and Ph.D. graduate support
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Data Analytics
q 
Apache Spark is a popular data analytics framework
High-level constructs for expressing analytics computations
❍  Fast and general engine for large-scale data processing
❍  Supports datasets larger than the system physical memory
❍ 
q 
Improves programmer productivity through
HLL front-ends (Scala, R, SQL)
❍  Multiple domain-specific libraries:
❍ 
◆  Streaming, SparkSQL, SparkR, GraphX, Splash, MLLib, Velox
q 
Specialized runtime provides for
q 
Elastic parallelism and resilience
Developed for cloud and commodity environments
❍ 
q  Latency-optimized local disk storage and bandwidth-optimized network
q 
Can Spark run well on HPC platforms?
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Preview of Main Contributions
q  Improved
❍  100
scaling of Spark on Lustre 520x on HPC
cores to 52,000 cores on Cray
q  Deliver
scalable data intensive processing systems
competitive to node-level local storage (SSD)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Berkeley Data Analytics Stack (Spark)
Page
Rank
BigData Benchmark
Collab Filter,
Spark-Perf
From https://amplab.cs.berkeley.edu/software/
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
HPC Node Performance on Spark
q  Cray
node ~2x slower than workstation for Spark
❍  When
data on disk
❍  Same concurrency
86%
40%
◆ 86% slower on Edison
❍  All
cores
◆ 40% slower on Edison
◆ when using all cores (24)
q  The
problem is I/O!
q  Cray matches performance when data is cached
Spark SQL Big Data Benchmark
S3Suffix
ScaleFactor
Rankings(rows)
Rankings(bytes)
UserVisits(rows)
UserVisits(bytes)
Documents(bytes)
/5nodes/
5
90Million
6.38GB
775Million
126.8GB
136.9GB
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Spark and HPC Design Assumptions
Why is I/O a problem on the HPC node?
q  Spark expects
q 
Local disk w/ HDFS overlay for distributed file system
❍  Can utilize fast local disk (SSD) for shuffle files
❍  Assumes ALL disk operations are fast
❍ 
q 
Spark generally targets cloud / commodity cluster
Disk: I/O optimized for latency
❍  Network: optimized for bandwidth
❍  Matches well to Spark expectations
❍ 
q 
HPC system conflict with these assumptions
Disk: I/O optimized for bandwidth
❍  Network: optimized for latency
❍ 
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Getting a Handle on Design Assumptions
q 
Can HPC architectures give performance advantages?
❍ 
Do we need local disks?
◆ Cloud: node local SSD
◆ BurstBuffer: mid layer of SSD storage
◆ Lustre: backend storage system
❍ 
Can we exploit the advantages of HPC networks?
◆ What is the impact of RDMA optimizations?
q 
Differences in architecture guide the software design
Evaluation of Spark on HPC system (Cray XC30, XC40)
❍  Techniques to improve performance on HPC architectures
by eliminating disk I/O overhead
❍ 
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Data Movement in Spark
Block is the unit of movement and execution (persistent)
q 
❍ 
❍ 
Vertical movement: blocks transferred across network to another node
Horizontal movement: blocks moved between memory
Shuffle involves both vertical + horizontal movement (temporary)
Core
Core
Core
Task
Task
Task
Block Manager
Memory
S
Shuffle Manager
Persistent
Storage
Task
Task
Task
Network
Core
L
Block Manager
Interconnection
q 
Core
Core
Core
Memory
Shuffle Manager
L
Persistent
Storage
R
Temporary
Storage
Core
Temporary
Storage
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
I/O Happens Everywhere
q 
Program input/output
Explicit
❍  Distributed with
global namespace
(HDFS)
BlockManager
❍ 
q 
Shuffle and Block
manager
Implicit
❍  Local (Java)
Shuffle
A:
BlockManager
B:
Read
Input
Shuffle
G:
Stage1
Write
Output
groupBy
F:
D:
C:
❍ 
◆ FileInputStream
◆ FileOutputStream
q 
Read
Input
BlockManager
Lots of I/O
❍ 
Both disk and network
BlockManager
map
E:
Stage2
join
union
BlockManager
Stage3
BlockManager
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Spark Data Management Abstraction
q  Resilient
Distributed Dataset (RDD)
❍  Composed
of partitions of data
◆ which are composed of blocks
❍  RDDs
are created from
other RDDs by applying
transformations or actions
❍  Has a lineage specifying
how its blocks are computed
❍  Requesting a block either retrieves from cache or
triggers computation
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Word Count Example
val textFile = sc.textFile(”input.txt")!
val counts = textFile.flatMap(line => line.split(" "))!
.map(word => (word, 1))!
.reduceByKey(_ + _)!
counts.collect()
q  Transformations
textFile
declare intent
flatMap
❍  Do
not trigger computation but simply build
the lineage
❍  textFile, flatMap, map, reduceByKey
q  Actions
trigger computation on parent RDD
❍  collect
q  Data
map
reduceByKey
transparently managed by runtime
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Partitioning
textFile
textFile
flatMap
flatMap
map
map
reduceByKey
reduceByKey
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Data is partitioned by the runtime
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Stages
JOB 0
STAGE 0
STAGE 1
textFile
flatMap
map
reduceByKey
(local)
reduceByKey
(global)
p1
p1
p1
p1
p1
p2
p2
p2
p2
p2
p3
p3
p3
p3
p3
q  Vertical
data movement (local) between
transformations
q  Horizontal (remote) and vertical data movement
between stages (shuffle)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Structure of Spark Input
Data in Spark formats is pre-partitioned
q  Parquet format:
q 
❍ 
Single directory with
_common_metadata
._common_metadata.crc _metadata ._metadata.crc part-r-00001.gz.parquet .part-r-00001.gz.parquet.crc part-r-00002.gz.parquet .part-r-00002.gz.parquet.crc
[…]
part-r-03977.gz.parquet
.part-r-03977.gz.parquet.crc
_SUCCESS
._SUCCESS.crc
❍ 
This example has 3,977 partitions
◆ 3,977 data files
◆ 3,977 checksum files
◆ 3 metadata files and 3 metadata checksum files
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Reason for Multiple File Opens
Even with greatly increased time spent blocking on
open, individual opens are still short
q  However, there are a large number of opens
q 
❍ 
At minimum, each task opens file, reads its part, closes file
◆ For each partition of the input, at least one file open
❍ 
Many readers do multiple opens per partition
◆ sc.textFile => 2 opens per partition
❍ 
Parquet reader: each task
◆ Opens input file, opens checksum file, compares checksum
◆ Closes input file and checksum file
◆ Opens input file, reads footer, closes input file
◆ Opens input file, reads actual data, closes input file
Total of 4 file open/close cycles per task
❍  3,977 * 4 = 15,908 file opens to read BDD dataset
❍ 
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Shuffle (Communication)
q  Output
JOB 0
Map
(Map) phase
❍ 
Every
STAGE
0
node
independently
textFile
flatMap
STAGE 1
map
reduceByKey
(local)
reduceByKey
(global)
p1
p1
p1
p2
p2
p3
p3
◆ sorts
local data
p1
p1
◆ writes
sorted
p2
p2 data to disk
p2
q  Input
(Reduce) phase
p3
❍  Every
p3
p3
node
◆ reads local blocks from disk
Reduce
◆ issues requests for remote blocks
◆ services incoming requests for blocks
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Shuffle Directory Structure
Task
Core
Task
Task
Block Manager
Memory
S
Shuffle Manager
Persistent
Storage
Task
Task
L
Block Manager
Core
Core
Core
Core
Memory
Shuffle Manager
L
Persistent
Storage
R
Temporary
Storage
q 
Task
Network
Core
Core
Interconnection
Core
Temporary
Storage
Each node stores its maps in a worker specific directory in shuffle
temporary storage
❍ 
Storage is in subdirectories
◆  15/shuffle_0_1_0.data, 36/shuffle_0_2_0.data, 29/shuffle_0_3_0.data, …
❍ 
As many shuffle files as there are shuffle tasks
◆  # of shuffle tasks is configurable (# block managers is configurable)
◆  Default: max # of partitions of any parent RDD
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Shuffle Writes
q 
Each shuffle file is written to as many times as there are
partitions in the input
This is controllable
❍  Need partitions > cores for load balancing/latency hiding
❍ 
q 
For each write
Open file in append mode
❍  Write results of sorting partition
❍  Close file
❍ 
q 
Size of write is size of partition after local reduce
Varies with workload
❍  Partitions are constrained to fit in memory
❍  Often very small (in practice will never exceed 1 GB)
❍ 
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Ok, so what about Spark on HPC Systems?
q  Spark
certainly can run on HPC platforms
q  Consider Spark on Cray machines
❍  Cray
XC30 (Edison) and XC40 (Cori)
❍  Lustre parallel distributed file system
q  Out
of the box, there are performance issues
q  The good news is that it’s all about Lustre!
q  Perform experiments comparing
❍  Lustre
❍  Burst
Buffer
❍  Spark benchmarks
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Lustre Design
One node
Spark
Multiple nodes
Network
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Experimental Setup
q 
Cray XC30 at NERSC (Edison)
❍ 
❍ 
q 
❍ 
❍ 
2.3 GHz Haswell
128 GB RAM
Cray DataWarp
Comet at SDSC
❍ 
❍ 
❍ 
❍ 
❍ 
q 
2.4 GHz Ivy Bridge
64 GB RAM
Cray XC40 at NERSC (Cori)
❍ 
q 
Burst Buffer
2.5GHz Haswell
InfiniBand FDR
320 GB SSD
128 GB RAM regular nodes
1.5 TB RAM large memory nodes
Spark 1.5.0
❍ 
spark-perf benchmarks (Core + MLLib)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
I/O Scalability (Lustre and BB, Cori)
GroupByTest-I/OComponents-Cori
Spark results in
lots of opens
TimePerOpera1on(microseconds)
12000
Lustre-Open
BBPrivate-Open
BBStriped-Read
Lustre-Write
10000
8000
BBStriped-Open
Lustre-Read
BBPrivate-Read
BBStriped-Write
2,359,296
!
Single
MDS node
bottleneck
Read/Write scales
better than Open
6000
589,824
4000
2000
9,216
36,864
Multiple
OSS nodes
147,456
✔
0
1
BB: Burst Buffer
2
4
Nodes
8
16
Open/Close operations O(cores2)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
I/O Variability is HIGH with Extreme Outliers
READ
q 
q 
# fopens quickly overwhelms the MDS
Variability in fopen access time is the real problem
❍ 
q 
fopen
Mean time is 23X larger than SSD, variability is 14,000X
Extreme outliers results in straggler tasks (longest sets a bound!)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Many Opens versus One Open
Slowdown ofn(ORC)vs.O(nR)C
ManyOpen-Read-Close Cyclesvs.Open-OnceRead-Many
61
51
Lustre
41
31
21
11
1
1024
8192
65536
ReadSize(bytes)
524288
EdisonLustre
Workstation LocalDisk
CoriLustre
CoriStripedBB
CoriPrivateBB
CoriMounted File
CometLustre
CometSSD
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Improving I/O Performance
q 
Eliminate file operations that affect the metadata server
Combine files within a node (currently per core combine)
❍  Keep files open (cache fopen())
❍  Use memory mapped local file system /dev/shm (no spill)
❍  Use file system backed by single Lustre file
❍ 
q 
Partial solutions that need to be used in conjunction
Memory pressure is high in Spark due to resilience and poor
garbage collection
❍  fopen() not necessarily from Spark (e.g., Parquet reader)
❍  Third party layers not optimized for HPC/Lustre
❍ 
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Adding File-Backed Filesystems in Shifter
q 
NERSC Shifter
Lightweight container infrastructure for HPC
❍  Compatible with Docker images
❍  Integrated with Slurm scheduler
❍  Idea: control mounting of filesystems within container
❍ 
q 
Per-Node Cache
❍ 
New feature added to improve Spark performance
​--volume=$SCRATCH/backingFile:mnt:perNodeCache=size=100G
File for each node is stored on backend Lustre filesystem
❍  File-backed filesystem mounted within each node’s
container instance at common path (/mnt)
❍  Single file open (intermediate data file opens are kept local)
❍ 
q 
Now deployed in production on Cori
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Scalability with Shifter Luster-mount Solution
Cori-GroupBy-WeakScaling-TimetoJobCompleLon
700
Ramdisk
600
Does not take away
memory resources and
not limited in size to
available memory
MountedFile
Time(s)
500
Lustre
Spark on Lustre with
mounted files scales
up to O(10,000) cores
400
300
200
Only 60% slower
than in-memory
Spark on plain Lustre scales
only up to O(100) cores
100
0
1
5
10
20
40
80
160
Nodes
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
320
User-level Techniques Reduce System Call Overhead
Cori-GroupBy-WeakScaling-TimetoJobCompleLon
18000
Ramdisk
Ramdisk+ShiRer
Ramdisk+Pooling
16000
14000
Time(s)
12000
10,000 cores w/ Ramdisk
•  + file pooling (8% speedup)
•  + Shifter (15% speedup)
10000
8000
6000
4000
2000
0
1
5
10
20
40
80
160
320
Nodes
•  File Pooling reduces time spent in syscalls by avoiding fopen calls
•  Shifter moves some calls into user mode
•  Shifter also benefits shared libraries, class files, and so on, which
are stored on mounted read-only filesystem
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Midlayer storage: Optimizing for the Tail
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
BB median open is 2x slower than Lustre
BB open variance is 5x smaller
BB scales better than
standalone Lustre
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
0
10
8
6
4
1
2
naive-bayes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-mat
word2vec
fp-growth
lda
pic
16
naive-bayes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-mat
word2vec
fp-growth
lda
pic
spearman
summary-stats
prefix-span
18
naive-bayes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-mat
word2vec
fp-growth
lda
pic
spearman
summary-stats
prefix-span
naive-bayes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-mat
word2vec
fp-growth
lda
pic
spearman
summary-stats
prefix-span
Global vs. Local Storage
Comet-spark-perfMLLib-Lustre/SSDTime
Comet – Time with Lustre
storage / Time with SSD storage
I/O-bound benchmarks
14
12
Computebound
benchmarks
2
4
Nodes
•  When the shuffle is not quadratic or iterative
–  Tasks are compute-bound
–  Lustre and NAS storage is competitive
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
8
0
1
• 
• 
2
4
naive-bayes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-mat
word2vec
fp-growth
lda
pic
spearman
summary-stats
prefix-span
naive-bayes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-mat
word2vec
fp-growth
lda
pic
spearman
summary-stats
prefix-span
naive-bayes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-mat
word2vec
fp-growth
lda
pic
spearman
summary-stats
prefix-span
14
naive-bayes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-mat
word2vec
fp-growth
lda
pic
spearman
summary-stats
prefix-span
naive-bayes
kmeans
svd
pca
block-matrix-mult
pearson
chi-sq-feature
chi-sq-gof
chi-sq-mat
word2vec
fp-growth
lda
pic
spearman
summary-stats
prefix-span
Optimizations Make Global Storage Competitive
Cori – Time withCori-spark-perfMLLib-Lustre/Lustre-mountTime
Lustre storage / Time with Lustre-backed File
12
10
8
6
4
2
8
–  This indicates that most of the overhead (>10x) is in fopen
–  Not in latency/BW to storage (<2x)
•  about 2x more slowdown in SSD vs Lustre than mounted file vs Lustre.
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
16
Nodes
Shifter eliminates remote metadata I/O, but not read/write I/O
High-shuffle benchmark slowdown is high on Lustre compared to Lustre-backed mounted file
Global Storage matches Local Storage
chi-sq-feature
512 cores
16
App
14
Fetch
JVM
Time(s)
12
10
8
6
4
2
0
1
2
4
8
Comet
16
1
2
4
8
16
CometRDMA
1
2
4
8
16
Cori
•  Cori (TCP) is 20% faster than Comet (RDMA)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Competitive Advantage from Network Performance
pic- PowerIterationClustering
120
App
Fetch
JVM
100
You do need good TCP
Time(s)
80
60
40
20
0
1
2
4
8
Comet
16
1
2
4
8
16
CometRDMA
1
2
4
8
16
Cori
•  Benefit of RDMA optimizations is target dependent
–  Single node Comet is 50% faster than Cori
–  Cori/TCP is 27% faster than Comet/RDMA on 16 nodes
•  Better communication leads to higher availability of cores
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Conclusions and Impact
q 
Transformative recipe for tuning Spark on Lustre
We started at O(100) cores scalability
❍  Showed O(10,000) cores scalability
❍ 
q 
NERSC used our solutions up to 52,000 cores
(Cori Phase I whole machine)
❍ 
Lustre-mount released in Shifter J (Cori and Edison)
https://github.com/NERSC/shifter
q 
Future work
Global namespace enables “Spark” redesign
❍  Combine Lustre-mount and Burst Buffer
❍  Decentralize scheduler
❍  Evaluate competitive advantage from better networks
❍ 
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)