High-Performance Computing (HPC).

Comparison of Big Data versus
High-Performance Computing:
Some Observations
Helmut Neukirchen
University of Iceland
[email protected]
Research performed as visiting scientist at
Jülich Supercomputing Centre (JSC).
Computing time granted by JSC. Thanks to Morris Riedel.
Use High-Performance Computing
or Big Data processing?
• Standard approach for computationally intensive problems:
High-Performance Computing (HPC).
– Low-level C/C++/Fortran parallel processing implementations,
– Low-level (Message Passing Interface MPI) send/receive communication,
• Fast interconnects (e.g. InfiniBand).
– Fast central storage (RAID array attached via interconnect).
• Is parallel processing offered by the Big Data framework
Apache Spark a competitive alternative to HPC?
– High-level Java/Scala/Python implementations,
– Convenient high-level programming model (serial view, implicit communication),
– Distributed HDFS file system, slow Ethernet.
• Move processing to where the data is locally stored (mitigates slow communication).
2
Spark against HPC –
a practical experiment: Hardware
• Cluster JUDGE at Jülich Supercomputing Centre.
– 39 executor nodes:
• each 42 GB usable RAM,
• shared by 24 hyperthreads.
– (2 Intel Xeon X5650: each 6 cores=12 hyperthreads each),
– Totalling in 39*24=936 hyperthread cores.
• Connected via InfiniBand (for HPC) and GB Ethernet (for Spark).
• For Spark: local hard disks: each 222.9 MB/s peak, totalling
8.7 GB/s in parallel. HDFS replication factor 2, 128 MB blocks.
• For HPC: Storage cluster JUST connected via InfiniBand: 160 GB/s peak.
3
Spark against HPC –
a practical experiment: Application
• Density-based spatial clustering of
applications with noise (DBSCAN).
– Detects arbitrarily shaped clusters,
– Detects and can filter noise,
– No need to know number of clusters.
• Two parameters:
– Spatial search radius ε,
– Point density minPts.
• At least minPts elements needed within ε radius to form a cluster.
• Otherwise considered noise.
Ester, Kriegel, Sander, Xu, “Density-based spatial clustering of
applications with noise” Proc. Second Int. Conf. on Knowledge
Discovery and Data Mining. AAAI Press, 1996.
4
DBSCAN: Properties
• Simple distance calculations (=more like big data),
but still floating point (=more like HPC).
– Compare each of the n points with each of the remaining n-1
points to see whether their distance is ≤ ε  O(n²).
– Spatially sorted data structures (R-trees, R*-trees, kd-trees):
compare each of the n points with spatially close points only.
 O(n log n).
• No permanent intermediate results exchange
(=more like big data), but still strong relationship
between data (=more like HPC).
5
DBSCAN: Parallelisation
• Originally, formulated as a non-parallel (“serial”) algorithm.
• Clustering itself independently possible in parallel:
1. Re-shuffle all data to have it spatially sorted (R-trees, R*-trees, kd-trees)
to ease decomposition of input domain into boxes,
2. Independent clustering of boxes in parallel,
(ε overlap at box boundaries needed to deal with points in the border area
of the neighbour boxes: “ghost” or “halo” regions).
3. Final result exchange
(between neighbour boxes)
to merge clusters spanning multiple boxes.
6
Spark against HPC – a practical experiment:
Benchmarked DBSCAN Implementations
• HPC (MPI and OpenMP): C++
– HPDBSCAN, arbitrary dimensions, O(n log n),
https://bitbucket.org/markus.goetz/hpdbscan.
• Spark: Scala/JVM, all 2D only:
– Spark DBSCAN, O(n²), https://github.com/alitouka/spark_dbscan,
– Spark_DBSCAN, O(n²), https://github.com/aizook/SparkAI,
– DBSCAN On Spark, https://github.com/mraad/dbscan-spark.
• Does in fact implement only an approximation of DBSCAN (square boxes of domain
decompostion cells used as density instead of ε search radius and halos) yielding completely
different (=wrong) clusters.
– RDD-DBSCAN, O(n log n),
https://github.com/irvingc/dbscan-on-spark.
• Serial Java/JVM implementation for comparison:
– ELKI 0.7.1, O(n log n) using R* tree,
https://elki-project.github.io.
7
HPDBSCAN: Domain Decomposition
• Highly Parallel DBSCAN:
Götz, Bodenstein, Riedel, “HPDBSCAN: highly parallel
DBSCAN,” Proc. Workshop on Machine Learning in
High-Performance Computing Environments, in
conjunction with Super Computing 2015, ACM.
– Decompose input data into ε size cells.
– Load balancing between processors based on comparison
costs (= # of distance calculations).
• Costs per cell := #points * #neighbours
• Supports different number of cells
assigned to each processor to
achieve approx. same amount of
comparisons per processor.
– ε halos at processor boundaries.
8
Cordova, Moh, “DBSCAN on Resilient
Distributed Datasets,” in 2015 Int.
Conf. on High Performance Computing
& Simulation (HPCS), IEEE.
RDD-DBSCAN:
Domain Decomposition
• Load balancing based on number of points:
1. Initial data space.
http://www.irvingc.com/visualizing-dbscan
2. Recursive horizontal or vertical 3. Grow boxes by ε on each of the
split of data space into boxes
4 sides to achieve overlap.
containing same number of points.
Boxes do not get smaller than ε.
9
Spark against HPC – a practical
experiment: Dataset used in Benchmark
• Geo-tagged tweets of covering United Kingdom.
– Heavily skewed, e.g. most tweets located in London.
– Trivia: Twitter spam noise: geo-tagged spam with nonsense locations.
• DBSCAN parameters: ε =0.01, minPts=40.
– Will return as clusters locations where people tweet a lot (e.g. tourist
spots, cities, but also roads/train tracks, ferries across the channel).
• Size & file format:
– 3 704 351 (longitude, latitude floating point) data points.
– Not really big data:
• 57 MB in HDF5 binary HPC format,
• Spark does not support binary formats very well, needed to convert to CSV:
67 MB in CSV textual format for Spark (=fits into 1 HDFS block).
http://hdl.handle.net/11304/6eacaa76-c275-11e4-ac7e-860aa0063d1f
10
Measurements: numbers
Neukirchen: “Survey and Performance Evaluation of DBSCAN Spatial Clustering
Implementations for Big Data and High-Performance Computing Paradigms”,
Technical Report VHI-01-2016, University of Iceland, November 2016.
11
seconds
Measurements: charts
Scalability of HPDBSCAN
cores
12
Interpretation of
Measurements & Implementation
• Benchmark of O(n²) implementations aborted (far too slow),
• Data is heavily skewed (high-density in London):
– Domain decomposition of RDD-DBSCAN cannot compete with
HPDBSCAN: RDD partitions not equally filled:
• While almost all executors have finished their work, there are a few
long running tasks remaining (that process those boxes containing a
lot of data points):
935 cores idle, but 1 core busy for 5 further minutes... In fact that
high-density box takes so long that 57 cores are enough for the rest.
• C++ is ≈9 times faster than Java/JVM.
– Spark Scala/JVM RDD-DBSCAN on 1 core
≈7 times slower than optimised serial Java/JVM ELKI.
13
Conclusions
• What matters for HPC, still applies to Apache Spark:
– Implementation complexity matters,
– Domain decomposition/load balancing matters.
• HPC faster than Apache Spark:
– Java/Scala significantly slower than C/C++,
• Unfortunately, no C/C++ big data frameworks available.
– HPC I/O is typically faster,
• Even though non-local: fast RAID and fast interconnects.
– Automated Spark parallel processing not as good as handcrafted HPC.
– Binary data formats not well supported by big data frameworks.
• But:
– HPC hardware far more expensive,
– Spark runs are fault tolerant!
– More efforts to implement low-level HPC code than high-level Spark code!
14
• Thank you for your attention!
• Any questions?
Supercomputing /
High-Performance Computing (HPC)
• Computationally intensive problems. Mainly:
– Floating Point Operations (FLOP).
• HPC algorithms implemented rather low-level
(=close to hardware/fast):
– Programming languages: Fortran, C/C++.
– Explicit intermediate results exchange (MPI).
• Input & output data processed by a node
fit typically into its main memory (RAM).
http://www.vedur.is/vedur/frodleikur/greinar/nr/3226
16
https://www.quora.com/topic/
Message-Passing-Interface-MPI
16
HPC hardware
• Compute nodes: fast CPUs.
• Nodes connected via fast interconnects (e.g. InfiniBand).
• Parallel File System storage: accessed by compute nodes via interconnnect.
– Many hard disks in parallel (RAID): high aggregated bandwidth.
•
Very Expensive, but needed for highest performance of HPC processing model:
– Read input once, compute & exchange intermediate results, write final result.
Storage
Compute nodes
Compute nodes
17
17
http://www.dmi.dk/nyheder/arkiv/nyheder-2016/marts/ny-supercomputer-i-island-en-billedfortaelling/
http://www.semantic-evolution.com
Big Data processing
• Typically, simple operations instead of number crunching.
– E.g. search engine crawling the web: index words & links on web pages.
• Algorithms require not much intermediate results exchange.
 Input/Output (I/O) of data most time consuming.
– Computation and communication less critical.
 Big Data algorithms can be implemented rather high-level:
– Programming languages: Java, Scala, Python.
– Big Data platform: Apache Spark (in the past: Apache Hadoop/Map Reduce):
• Automatically read new data chunks,
• Automatically execute algorithm implementation in parallel,
• Automatically exchange intermediate results as needed.
18
Big Data hardware
• Cheap standard PC nodes with
local storage, Ethernet network.
https://www.flickr.com/photos/cmnit/
2040385443mantic-evolution.com
– Distributed File System (HDFS): each
node stores locally a part of the whole data.
– Hadoop/Spark move processing of data
to where the data is locally stored.
Slow network connection not critical.
– Cheap hardware more likely to fail:
Hadoop and Spark are fault tolerant.
• Processing model: read chunk of local
data, process chunk locally, repeat;
finally: combine and write result.
19