Performance Optimization of Component-based Data Intensive Applications

Outline
Performance Optimization of
Component-based Data Intensive
Applications
n
n
n
Alan Sussman
Michael Beynon
Tahsin Kurc
Umit Catalyurek
Joel Saltz
n
n
Motivation
Approach
Optimization – Group Instances
Optimization – Transparent Copies
Ongoing Work
University of Maryland
http://www.cs.umd.edu/projects/adr
Alan Sussman ([email protected])
Targeted Applications
Runtime Environment
Pathology
Heterogeneous Shared Resources:
n Host level: machine, CPUs, memory, disk storage
n Network connectivity
Volume
Rendering
Many Remote Datasets:
n Inexpensive archival storage clusters (1TB ~ $10k)
n Islands of useful data
n Too large for replication
Surface
Groundwater
Modeling
Satellite
Data Analysis
Alan Sussman ([email protected])
Range
Query
3
DataCutter
Client
Client
Segment
Data
Client
Interface Service
Filtering Service
Filter
Filter
Alan Sussman ([email protected])
Indexing
Service
Data
Access Service
DataCutter
Archival Storage
System
Segments: (File,Offset,Size)
Archival Storage
System
4
DataCutter
Indexing Service
n Multilevel hierarchical indexes based on spatial indexing
methods – e.g., R-trees
–
–
Segment
Info.
2
Relies on underlying multi-dimensional space
User can add new indexing methods
Filtering Service
n Distributed C++ component framework
n Transparent tuning and adaptation for heterogeneity
n Filters implemented as threads – 1 process per host
Versions of both services integrated into SDSC SRB
(File,Offset,Size)
Alan Sussman ([email protected])
6
1
Indexing - Subsetting
Filter-Stream Programming (FSP)
Purpose: Specialized components for processing data
Datasets are partitioned into segments
–
–
used to index the dataset, unit of retrieval
Spatial indexes built from bounding boxes of all
elements in a segment
n
n
–
Indexing very large datasets
–
–
–
–
Multi-level hierarchical indexing scheme
Summary index files -- for a collection of segments or
detailed index files
Detailed index files -- to index the individual segments
n
–
n
manually specify filter connectivity
and filter-level characteristics
Raw Dataset
Alan Sussman ([email protected])
–
–
logical collection of filters to use together
application starts filter group instances
8
leverage filter affinity to dataset
n minimize communication volume on slower connections
n co-locate filters with large communication volume
–
–
–
Computation
–
expensive computation on faster, less loaded hosts
–
Alan Sussman ([email protected])
uow 2
“work” is application defined (ex: a query)
work is appended to running instances
init(), process(), finalize() called for each work
process() returns { EndOfWork | EndOfFilter }
allows for adaptivity
9
Optimization - Group Instances
S
A
Unit-of-work cycle
Communication
n
Reference DB
Filter Group
n
–
Extract ref
Extract raw
FSP: Abstractions
The dynamic assignment of filters to particular hosts
for execution is placement(or mapping)
Optimization criteria:
–
View result
3D reconstruction
unidirectional buffer pipes
uses fixed size buffers (min, good)
7
Placement
n
high level tasks
init,process,finalize interface
streams – how filters communicate
–
Alan Sussman ([email protected])
n
based on Active Disks research [Acharya, Uysal, Saltz: ASPLOS’98],
dataflow, functional parallelism, message passing.
filters – logical unit of computation
buf
B
uow 1
buf
buf
uow 0
buf
Alan Sussman ([email protected])
10
Experiment - Application Emulator
Parameterized dataflow filter
P0
F0
C0
–
Work
–
P1
F1
–
C1
consume from all inputs
compute
produce to all outputs
in
out
process
out
Filter
Application emulated:
host1 (2 cpu)
host2 (2 cpu)
–
host3 (2 cpu)
–
Match # instances to environment (CPU capacity, network)
Alan Sussman ([email protected])
11
process 64 units of work
single batch
P
Alan Sussman ([email protected])
F
C
12
2
Instances: Vary Number, Application
I/O Intensive
CPU Intensive
40
Response Time (sec)
Response Time (sec)
35
30
25
20
15
10
5
450
400
1
2
4
8
16
mean
32
150
100
1
64
2
4
min
8
16
mean
32
400
350
Response Time (sec)
Response Time (sec)
450
400
300
250
200
150
100
50
8
mean
max
host1 (2 cpu)
16
13
P3
250
Work
200
150
n
0
1 (8)
host1 (2 cpu)
2 (32)
4 (8)
8 (4)
16 (4) 32 (2)
P4
64 (1)
Batch Size (Optimal Num Instances)
min
mean
max
–
n
F3
C3
host2 (2 cpu)
R
P7
host1 (2 cpu)
15
E
C4
F7
C7
Ra0
host2 (2 cpu)
hostSMP (8 cpu)
Alan Sussman ([email protected])
16
Runtime Workload Balancing
Use local information:
M
–
Rak
n
work-balance among copies
better tune resources to actual filter needs
n
queue size, send time / receiver acks
Adjust number of transparent copies
Demand based dataflow (choice of consumer)
–
–
Within a host – perfect shared queue among copies
Across hosts
Round Robin
Weighted Round Robin
n Demand-Driven sliding window (on buffer consumption rate)
n User-defined
Multiple producers and consumers, deadlock, flow control
Invariant: UOW i < UOW i+1
n
n
Problem: filter state
Alan Sussman ([email protected])
F4
all
Provide “single-stream” illusion
–
C0
50
FSP Abstraction
–
F0
100
Optimization - Transparent Copies
–
14
Adding Heterogeneity
P0
Alan Sussman ([email protected])
n
host3 (2 cpu)
Alan Sussman ([email protected])
300
all
replicate individual filters
transparent
host2 (2 cpu)
Work issued in instance batches until all complete.
Matching # instances to environment (CPU capacity)
Setup: (Optimal) is lowest all time
Point: # instances depends on application and environment
n
C1
350
Number of Filter Group Instances
min
F1
64
max
CPU Intensive
450
4
P1
Number of Filter Group Instances
max
Instances: Vary Number, Batch Size
2
C0
Batch 0
50
Alan Sussman ([email protected])
1
F0
200
Setup: UMD Red Linux cluster (2 processor PII 450 nodes)
Point: # instances depends on application and environment
0
P0
Batch 1
300
250
Number of Filter Group Instances
min
Batch 0
Batch 2
350
0
0
Group Instances (Batches)
17
Alan Sussman ([email protected])
18
3
Experiment – Virtual Microscope
n
n
n
n
Virtual Microscope Results
Client-server system for interactively visualizing digital slides
Image Dataset (100MB to 5GB per focal plane)
Rectangular region queries, multiple data chunk reply
Hopkins Linux cluster– 4 1-processor, 1 2-processor PIII-800,
2 80GB IDE disks, 100Mbit Ethernet
Decompress filter is most expensive, so good candidate for
replication
50 queries at various magnifications, 512x512 pixel output
read data
decompress
clip
zoom
R-D-C-Z-V
Average
h-h-h-h-h
2.10
h-g-g-g-g
1.49
h-g(2)-g-g-g
1.15
h-g(2)-g(2)-g-g
1.15
h-g(4)-g(2)-g-g
1.17
h-g(2)-g(2)-b-b
1.68
g-g-g-g-g
1.44
g-g(2)-g-g-g
1.08
view
Alan Sussman ([email protected])
19
n
50x
6.95
4.60
3.41
3.43
3.50
5.34
4.46
3.24
Alan Sussman ([email protected])
Experiment - Isosurface Rendering
n
Response Time (seconds)
400x
200x
100x
0.38
0.73
1.73
0.37
0.62
1.27
0.39
0.50
0.95
0.37
0.49
0.95
0.39
0.50
0.96
0.45
0.68
1.27
0.33
0.58
1.26
0.33
0.45
0.92
20
Sample Isosurface Visualization
UT Austin ParSSim species transport simulation
Single time step visualization, read all data
Setup: UMD Red Linux cluster (2 processor PII 450 nodes)
1.5 GB
38.6 MB
R
read
dataset
0.64s
4.3%
11.8 MB
E
isosurface
extraction
1.64s
11.2%
28.5 MB
Ra
shade +
rasterize
11.67s
79.5%
M
merge
/ view
0.73s
5.0%
= 14.68s (sum)
= 12.65s (time)
Alan Sussman ([email protected])
V = 0.35
21
Alan Sussman ([email protected])
Transparent Copies: Replicate Raster
1 node
1 copy of
Raster
12.18s
7.32s
4.17s
3.00s
2 copies
of Raster
8.16s
(33%)
5.70s
(22%)
3.88s
(7%)
3.24s
(– 8%)
Isosurface rendering on Red, Blue, Rogue Linux clusters at
Maryland
–
–
–
–
n
Setup: SPMD style, partitioned input dataset per node
Point: copies of bottleneck filter enough to balance flow
Alan Sussman ([email protected])
23
22
Experiment – Resource Heterogeneity
n
2 nodes 4 nodes 8 nodes
V = 0.7
Red – 16 2-processor PII-450, 256MB, 18GB SCSI disk
Blue – 8 2-processor PIII-550, 1GB, 2-8GB SCSI disk +
1 8-processor PIII-450, 4GB, 2-18GB SCSI disk
Rogue – 8 1 -processor PIII-650, 128MB, 2-75GB IDE disks
Red, Blue connected via Gigabit Ethernet, Rogue via 100Mbit
Ethernet
Two implementations of Raster filter – z-buffer and active
pixels (the one used in previous experiment)
Alan Sussman ([email protected])
24
4
Experimental setup
1.5 GB
38.6 MB
R
read
dataset
Active Pixel
Z-buffer
32.0MB
28.5 MB
11.8 MB
E
isosurface
extraction
0.64s
0.68s
Varying number of nodes
Ra
shade +
rasterize
1.64s
1.65s
Active Pixel Rendering
Z-buffer
Active Pixel
Configuration #Ra
M
merge
/ view
11.67s
9.43s
0.73s
0.90s
RE-Ra-M
1(0) n/a 12.7
1(1) 12.2 7.3
4.8
4.2
2.9 n/a 11.3
3.0 10.7 7.9
RE-Ra-M
2(0)
2(2)
3.2
3.9
2.6
3.2
= 14.68s
= 12.66s
Alan Sussman ([email protected])
25
n
4.2 6.5
4.1
DD RR WRR DD
3.0 4.4
3.0 3.5
4.0
5.1
9.3 10.5
8.9 11.7
19.0
20.8
26
Experimental setup
–
Conf.
4.0
n/a
8.6
Skewed data distribution
1 data node 2 data nodes 4 data nodes 8 data nodes
8.2
7.7
5.7
Alan Sussman ([email protected])
Heterogeneous nodes
RR WRR DD RR WRR
7.3
3.0 3.0 5.7
2.6
n/a
8.2
7.7 10.7
7.9 11.5
Only Red nodes used – each one runs 1 RE, 0, 1, or 2 RA, and one runs M
Experiment to follow combines R and E filters, since that showed
best performance in experiments not shown
RERa-M
RERa-M
Z-buffer Rendering
1
2
4
8
1
2
4
8
node nodes nodes nodes node nodes nodes nodes
4.1 4.1
RR WRR DD
3.8
2.9 3.8
4.0
3.7 4.2
–
–
25GB dataset from UT ParSSim (bigger grid than earlier
experiments)
Hilbert curve declustering onto disks on 2 Blue, 2 Rogue
nodes
Skew moves part of data from Blue to Rogue nodes
Active Pixel algorithm on 8 -processor Blue node + Red data nodes
Blue node runs 7 Ra or ERa copies and M, Red nodes each run 1 of each except M
Alan Sussman ([email protected])
27
Skewed data distribution
30
25
25
20
RR
W RR
DD
15
10
5
Multiple group instances
20
–
RR
15
WRR
DD
10
–
5
R-E Ra-M
F il te r#
Co n fi g u ration
RE Ra-M
RE -Ra -M
R-E Ra -M
–
RE-Ra- M
Fi l ter#Con fi gu rati o n
Higher utilization of under-used resources
Reduce application response time
but … requires work co-execution tolerance
Transparent copies can help
Ske we d#7 5%#-#Ac tiv e #Pixe l#Re nde ring
Ske w e d#50 %#-#A c tive #Pi xe l#Re nde r ing
30
25
20
RR
W RR
DD
15
10
5
Exe cu tio n #Ti me #
( s ec on ds )
30
Ex ec ution #Tim e#(se co nd s )
Experimental Implications
0
0
R ERa -M
28
S ke w e d#2 5 %#-#Ac tiv e #Pix el#Re nde r in g
30
Exe cu tio n #
T i me #
( s ec on ds )
Exec u ti o n #
T ime #
( s ec on d s)
Balanc e d#-#Ac tive #Pixe l#
R e nde ring
Alan Sussman ([email protected])
–
25
20
RR
WRR
15
–
DD
10
–
5
Most filter decompositions are unbalanced
Heterogeneous CPU capacity / load
but … requires buffer out-of-order tolerance
0
0
R ERa -M
R-E Ra-M
F il te r#
Co n fi g u ration
RE -Ra -M
RE Ra-M
R-E Ra -M
RE-Ra- M
Fi l ter#Con fi gu rati o n
Alan Sussman ([email protected])
29
Alan Sussman ([email protected])
30
5
Ongoing and Future Work
Ongoing and Future work
–
–
–
–
–
Automated placement, instances, transparent copies
– predictive (cost models)
– adaptive (work feedback)
Filter accumulator support (partitioning, replication)
Java filters
CCA-compliant filters
Very large datasets – including HDF5 format
– Using storage clusters at UMD and OSU, then
testbed at LLNL
Alan Sussman ([email protected])
31
6