Scaling Spark on Lustre Nicholas Chaimov (Univesity of Oregon) Costin Iancu, Khaled Ibrahim, Shane Canon (LBNL) Allen D. Malony (University of Oregon) [email protected] [email protected] [email protected] Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) My Background q Main area of research in HPC parallel performance measurement and analysis ❍ q TAU Performance System Research interests Performance observation (application + system) ❍ Introspection ❍ ◆ runtime performance observation and query In situ analysis ❍ Feedback and adaptive control ❍ q Performance optimization across whole system ❍ q Based on behavior knowledge and performance models This work supported by IPCC grant to LBL for data analytics (Spark) on Lustre ❍ Nick’s summer internship and Ph.D. graduate support Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Data Analytics q Apache Spark is a popular data analytics framework High-level constructs for expressing analytics computations ❍ Fast and general engine for large-scale data processing ❍ Supports datasets larger than the system physical memory ❍ q Improves programmer productivity through HLL front-ends (Scala, R, SQL) ❍ Multiple domain-specific libraries: ❍ ◆ Streaming, SparkSQL, SparkR, GraphX, Splash, MLLib, Velox q Specialized runtime provides for q Elastic parallelism and resilience Developed for cloud and commodity environments ❍ q Latency-optimized local disk storage and bandwidth-optimized network q Can Spark run well on HPC platforms? Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Preview of Main Contributions q Improved ❍ 100 scaling of Spark on Lustre 520x on HPC cores to 52,000 cores on Cray q Deliver scalable data intensive processing systems competitive to node-level local storage (SSD) Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Berkeley Data Analytics Stack (Spark) Page Rank BigData Benchmark Collab Filter, Spark-Perf From https://amplab.cs.berkeley.edu/software/ Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) HPC Node Performance on Spark q Cray node ~2x slower than workstation for Spark ❍ When data on disk ❍ Same concurrency 86% 40% ◆ 86% slower on Edison ❍ All cores ◆ 40% slower on Edison ◆ when using all cores (24) q The problem is I/O! q Cray matches performance when data is cached Spark SQL Big Data Benchmark S3Suffix ScaleFactor Rankings(rows) Rankings(bytes) UserVisits(rows) UserVisits(bytes) Documents(bytes) /5nodes/ 5 90Million 6.38GB 775Million 126.8GB 136.9GB Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Spark and HPC Design Assumptions Why is I/O a problem on the HPC node? q Spark expects q Local disk w/ HDFS overlay for distributed file system ❍ Can utilize fast local disk (SSD) for shuffle files ❍ Assumes ALL disk operations are fast ❍ q Spark generally targets cloud / commodity cluster Disk: I/O optimized for latency ❍ Network: optimized for bandwidth ❍ Matches well to Spark expectations ❍ q HPC system conflict with these assumptions Disk: I/O optimized for bandwidth ❍ Network: optimized for latency ❍ Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Getting a Handle on Design Assumptions q Can HPC architectures give performance advantages? ❍ Do we need local disks? ◆ Cloud: node local SSD ◆ BurstBuffer: mid layer of SSD storage ◆ Lustre: backend storage system ❍ Can we exploit the advantages of HPC networks? ◆ What is the impact of RDMA optimizations? q Differences in architecture guide the software design Evaluation of Spark on HPC system (Cray XC30, XC40) ❍ Techniques to improve performance on HPC architectures by eliminating disk I/O overhead ❍ Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Data Movement in Spark Block is the unit of movement and execution (persistent) q ❍ ❍ Vertical movement: blocks transferred across network to another node Horizontal movement: blocks moved between memory Shuffle involves both vertical + horizontal movement (temporary) Core Core Core Task Task Task Block Manager Memory S Shuffle Manager Persistent Storage Task Task Task Network Core L Block Manager Interconnection q Core Core Core Memory Shuffle Manager L Persistent Storage R Temporary Storage Core Temporary Storage Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) I/O Happens Everywhere q Program input/output Explicit ❍ Distributed with global namespace (HDFS) BlockManager ❍ q Shuffle and Block manager Implicit ❍ Local (Java) Shuffle A: BlockManager B: Read Input Shuffle G: Stage1 Write Output groupBy F: D: C: ❍ ◆ FileInputStream ◆ FileOutputStream q Read Input BlockManager Lots of I/O ❍ Both disk and network BlockManager map E: Stage2 join union BlockManager Stage3 BlockManager Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Spark Data Management Abstraction q Resilient Distributed Dataset (RDD) ❍ Composed of partitions of data ◆ which are composed of blocks ❍ RDDs are created from other RDDs by applying transformations or actions ❍ Has a lineage specifying how its blocks are computed ❍ Requesting a block either retrieves from cache or triggers computation Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Word Count Example val textFile = sc.textFile(”input.txt")! val counts = textFile.flatMap(line => line.split(" "))! .map(word => (word, 1))! .reduceByKey(_ + _)! counts.collect() q Transformations textFile declare intent flatMap ❍ Do not trigger computation but simply build the lineage ❍ textFile, flatMap, map, reduceByKey q Actions trigger computation on parent RDD ❍ collect q Data map reduceByKey transparently managed by runtime Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Partitioning textFile textFile flatMap flatMap map map reduceByKey reduceByKey Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Data is partitioned by the runtime Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Stages JOB 0 STAGE 0 STAGE 1 textFile flatMap map reduceByKey (local) reduceByKey (global) p1 p1 p1 p1 p1 p2 p2 p2 p2 p2 p3 p3 p3 p3 p3 q Vertical data movement (local) between transformations q Horizontal (remote) and vertical data movement between stages (shuffle) Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Structure of Spark Input Data in Spark formats is pre-partitioned q Parquet format: q ❍ Single directory with _common_metadata ._common_metadata.crc _metadata ._metadata.crc part-r-00001.gz.parquet .part-r-00001.gz.parquet.crc part-r-00002.gz.parquet .part-r-00002.gz.parquet.crc […] part-r-03977.gz.parquet .part-r-03977.gz.parquet.crc _SUCCESS ._SUCCESS.crc ❍ This example has 3,977 partitions ◆ 3,977 data files ◆ 3,977 checksum files ◆ 3 metadata files and 3 metadata checksum files Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Reason for Multiple File Opens Even with greatly increased time spent blocking on open, individual opens are still short q However, there are a large number of opens q ❍ At minimum, each task opens file, reads its part, closes file ◆ For each partition of the input, at least one file open ❍ Many readers do multiple opens per partition ◆ sc.textFile => 2 opens per partition ❍ Parquet reader: each task ◆ Opens input file, opens checksum file, compares checksum ◆ Closes input file and checksum file ◆ Opens input file, reads footer, closes input file ◆ Opens input file, reads actual data, closes input file Total of 4 file open/close cycles per task ❍ 3,977 * 4 = 15,908 file opens to read BDD dataset ❍ Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Shuffle (Communication) q Output JOB 0 Map (Map) phase ❍ Every STAGE 0 node independently textFile flatMap STAGE 1 map reduceByKey (local) reduceByKey (global) p1 p1 p1 p2 p2 p3 p3 ◆ sorts local data p1 p1 ◆ writes sorted p2 p2 data to disk p2 q Input (Reduce) phase p3 ❍ Every p3 p3 node ◆ reads local blocks from disk Reduce ◆ issues requests for remote blocks ◆ services incoming requests for blocks Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Shuffle Directory Structure Task Core Task Task Block Manager Memory S Shuffle Manager Persistent Storage Task Task L Block Manager Core Core Core Core Memory Shuffle Manager L Persistent Storage R Temporary Storage q Task Network Core Core Interconnection Core Temporary Storage Each node stores its maps in a worker specific directory in shuffle temporary storage ❍ Storage is in subdirectories ◆ 15/shuffle_0_1_0.data, 36/shuffle_0_2_0.data, 29/shuffle_0_3_0.data, … ❍ As many shuffle files as there are shuffle tasks ◆ # of shuffle tasks is configurable (# block managers is configurable) ◆ Default: max # of partitions of any parent RDD Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Shuffle Writes q Each shuffle file is written to as many times as there are partitions in the input This is controllable ❍ Need partitions > cores for load balancing/latency hiding ❍ q For each write Open file in append mode ❍ Write results of sorting partition ❍ Close file ❍ q Size of write is size of partition after local reduce Varies with workload ❍ Partitions are constrained to fit in memory ❍ Often very small (in practice will never exceed 1 GB) ❍ Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Ok, so what about Spark on HPC Systems? q Spark certainly can run on HPC platforms q Consider Spark on Cray machines ❍ Cray XC30 (Edison) and XC40 (Cori) ❍ Lustre parallel distributed file system q Out of the box, there are performance issues q The good news is that it’s all about Lustre! q Perform experiments comparing ❍ Lustre ❍ Burst Buffer ❍ Spark benchmarks Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Lustre Design One node Spark Multiple nodes Network Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Experimental Setup q Cray XC30 at NERSC (Edison) ❍ ❍ q ❍ ❍ 2.3 GHz Haswell 128 GB RAM Cray DataWarp Comet at SDSC ❍ ❍ ❍ ❍ ❍ q 2.4 GHz Ivy Bridge 64 GB RAM Cray XC40 at NERSC (Cori) ❍ q Burst Buffer 2.5GHz Haswell InfiniBand FDR 320 GB SSD 128 GB RAM regular nodes 1.5 TB RAM large memory nodes Spark 1.5.0 ❍ spark-perf benchmarks (Core + MLLib) Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) I/O Scalability (Lustre and BB, Cori) GroupByTest-I/OComponents-Cori Spark results in lots of opens TimePerOpera1on(microseconds) 12000 Lustre-Open BBPrivate-Open BBStriped-Read Lustre-Write 10000 8000 BBStriped-Open Lustre-Read BBPrivate-Read BBStriped-Write 2,359,296 ! Single MDS node bottleneck Read/Write scales better than Open 6000 589,824 4000 2000 9,216 36,864 Multiple OSS nodes 147,456 ✔ 0 1 BB: Burst Buffer 2 4 Nodes 8 16 Open/Close operations O(cores2) Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) I/O Variability is HIGH with Extreme Outliers READ q q # fopens quickly overwhelms the MDS Variability in fopen access time is the real problem ❍ q fopen Mean time is 23X larger than SSD, variability is 14,000X Extreme outliers results in straggler tasks (longest sets a bound!) Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Many Opens versus One Open Slowdown ofn(ORC)vs.O(nR)C ManyOpen-Read-Close Cyclesvs.Open-OnceRead-Many 61 51 Lustre 41 31 21 11 1 1024 8192 65536 ReadSize(bytes) 524288 EdisonLustre Workstation LocalDisk CoriLustre CoriStripedBB CoriPrivateBB CoriMounted File CometLustre CometSSD Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Improving I/O Performance q Eliminate file operations that affect the metadata server Combine files within a node (currently per core combine) ❍ Keep files open (cache fopen()) ❍ Use memory mapped local file system /dev/shm (no spill) ❍ Use file system backed by single Lustre file ❍ q Partial solutions that need to be used in conjunction Memory pressure is high in Spark due to resilience and poor garbage collection ❍ fopen() not necessarily from Spark (e.g., Parquet reader) ❍ Third party layers not optimized for HPC/Lustre ❍ Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Adding File-Backed Filesystems in Shifter q NERSC Shifter Lightweight container infrastructure for HPC ❍ Compatible with Docker images ❍ Integrated with Slurm scheduler ❍ Idea: control mounting of filesystems within container ❍ q Per-Node Cache ❍ New feature added to improve Spark performance --volume=$SCRATCH/backingFile:mnt:perNodeCache=size=100G File for each node is stored on backend Lustre filesystem ❍ File-backed filesystem mounted within each node’s container instance at common path (/mnt) ❍ Single file open (intermediate data file opens are kept local) ❍ q Now deployed in production on Cori Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Scalability with Shifter Luster-mount Solution Cori-GroupBy-WeakScaling-TimetoJobCompleLon 700 Ramdisk 600 Does not take away memory resources and not limited in size to available memory MountedFile Time(s) 500 Lustre Spark on Lustre with mounted files scales up to O(10,000) cores 400 300 200 Only 60% slower than in-memory Spark on plain Lustre scales only up to O(100) cores 100 0 1 5 10 20 40 80 160 Nodes Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) 320 User-level Techniques Reduce System Call Overhead Cori-GroupBy-WeakScaling-TimetoJobCompleLon 18000 Ramdisk Ramdisk+ShiRer Ramdisk+Pooling 16000 14000 Time(s) 12000 10,000 cores w/ Ramdisk • + file pooling (8% speedup) • + Shifter (15% speedup) 10000 8000 6000 4000 2000 0 1 5 10 20 40 80 160 320 Nodes • File Pooling reduces time spent in syscalls by avoiding fopen calls • Shifter moves some calls into user mode • Shifter also benefits shared libraries, class files, and so on, which are stored on mounted read-only filesystem Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Midlayer storage: Optimizing for the Tail 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% BB median open is 2x slower than Lustre BB open variance is 5x smaller BB scales better than standalone Lustre Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) 0 10 8 6 4 1 2 naive-bayes kmeans svd pca block-matrix-mult pearson chi-sq-feature chi-sq-gof chi-sq-mat word2vec fp-growth lda pic 16 naive-bayes kmeans svd pca block-matrix-mult pearson chi-sq-feature chi-sq-gof chi-sq-mat word2vec fp-growth lda pic spearman summary-stats prefix-span 18 naive-bayes kmeans svd pca block-matrix-mult pearson chi-sq-feature chi-sq-gof chi-sq-mat word2vec fp-growth lda pic spearman summary-stats prefix-span naive-bayes kmeans svd pca block-matrix-mult pearson chi-sq-feature chi-sq-gof chi-sq-mat word2vec fp-growth lda pic spearman summary-stats prefix-span Global vs. Local Storage Comet-spark-perfMLLib-Lustre/SSDTime Comet – Time with Lustre storage / Time with SSD storage I/O-bound benchmarks 14 12 Computebound benchmarks 2 4 Nodes • When the shuffle is not quadratic or iterative – Tasks are compute-bound – Lustre and NAS storage is competitive Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) 8 0 1 • • 2 4 naive-bayes kmeans svd pca block-matrix-mult pearson chi-sq-feature chi-sq-gof chi-sq-mat word2vec fp-growth lda pic spearman summary-stats prefix-span naive-bayes kmeans svd pca block-matrix-mult pearson chi-sq-feature chi-sq-gof chi-sq-mat word2vec fp-growth lda pic spearman summary-stats prefix-span naive-bayes kmeans svd pca block-matrix-mult pearson chi-sq-feature chi-sq-gof chi-sq-mat word2vec fp-growth lda pic spearman summary-stats prefix-span 14 naive-bayes kmeans svd pca block-matrix-mult pearson chi-sq-feature chi-sq-gof chi-sq-mat word2vec fp-growth lda pic spearman summary-stats prefix-span naive-bayes kmeans svd pca block-matrix-mult pearson chi-sq-feature chi-sq-gof chi-sq-mat word2vec fp-growth lda pic spearman summary-stats prefix-span Optimizations Make Global Storage Competitive Cori – Time withCori-spark-perfMLLib-Lustre/Lustre-mountTime Lustre storage / Time with Lustre-backed File 12 10 8 6 4 2 8 – This indicates that most of the overhead (>10x) is in fopen – Not in latency/BW to storage (<2x) • about 2x more slowdown in SSD vs Lustre than mounted file vs Lustre. Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) 16 Nodes Shifter eliminates remote metadata I/O, but not read/write I/O High-shuffle benchmark slowdown is high on Lustre compared to Lustre-backed mounted file Global Storage matches Local Storage chi-sq-feature 512 cores 16 App 14 Fetch JVM Time(s) 12 10 8 6 4 2 0 1 2 4 8 Comet 16 1 2 4 8 16 CometRDMA 1 2 4 8 16 Cori • Cori (TCP) is 20% faster than Comet (RDMA) Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Competitive Advantage from Network Performance pic- PowerIterationClustering 120 App Fetch JVM 100 You do need good TCP Time(s) 80 60 40 20 0 1 2 4 8 Comet 16 1 2 4 8 16 CometRDMA 1 2 4 8 16 Cori • Benefit of RDMA optimizations is target dependent – Single node Comet is 50% faster than Cori – Cori/TCP is 27% faster than Comet/RDMA on 16 nodes • Better communication leads to higher availability of cores Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Conclusions and Impact q Transformative recipe for tuning Spark on Lustre We started at O(100) cores scalability ❍ Showed O(10,000) cores scalability ❍ q NERSC used our solutions up to 52,000 cores (Cori Phase I whole machine) ❍ Lustre-mount released in Shifter J (Cori and Edison) https://github.com/NERSC/shifter q Future work Global namespace enables “Spark” redesign ❍ Combine Lustre-mount and Burst Buffer ❍ Decentralize scheduler ❍ Evaluate competitive advantage from better networks ❍ Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016) Workshop on Performance and Scalability of Storage Systems (WOPSSS 2016)
© Copyright 2026 Paperzz