Outline Performance Optimization of Component-based Data Intensive Applications n n n Alan Sussman Michael Beynon Tahsin Kurc Umit Catalyurek Joel Saltz n n Motivation Approach Optimization – Group Instances Optimization – Transparent Copies Ongoing Work University of Maryland http://www.cs.umd.edu/projects/adr Alan Sussman ([email protected]) Targeted Applications Runtime Environment Pathology Heterogeneous Shared Resources: n Host level: machine, CPUs, memory, disk storage n Network connectivity Volume Rendering Many Remote Datasets: n Inexpensive archival storage clusters (1TB ~ $10k) n Islands of useful data n Too large for replication Surface Groundwater Modeling Satellite Data Analysis Alan Sussman ([email protected]) Range Query 3 DataCutter Client Client Segment Data Client Interface Service Filtering Service Filter Filter Alan Sussman ([email protected]) Indexing Service Data Access Service DataCutter Archival Storage System Segments: (File,Offset,Size) Archival Storage System 4 DataCutter Indexing Service n Multilevel hierarchical indexes based on spatial indexing methods – e.g., R-trees – – Segment Info. 2 Relies on underlying multi-dimensional space User can add new indexing methods Filtering Service n Distributed C++ component framework n Transparent tuning and adaptation for heterogeneity n Filters implemented as threads – 1 process per host Versions of both services integrated into SDSC SRB (File,Offset,Size) Alan Sussman ([email protected]) 6 1 Indexing - Subsetting Filter-Stream Programming (FSP) Purpose: Specialized components for processing data Datasets are partitioned into segments – – used to index the dataset, unit of retrieval Spatial indexes built from bounding boxes of all elements in a segment n n – Indexing very large datasets – – – – Multi-level hierarchical indexing scheme Summary index files -- for a collection of segments or detailed index files Detailed index files -- to index the individual segments n – n manually specify filter connectivity and filter-level characteristics Raw Dataset Alan Sussman ([email protected]) – – logical collection of filters to use together application starts filter group instances 8 leverage filter affinity to dataset n minimize communication volume on slower connections n co-locate filters with large communication volume – – – Computation – expensive computation on faster, less loaded hosts – Alan Sussman ([email protected]) uow 2 “work” is application defined (ex: a query) work is appended to running instances init(), process(), finalize() called for each work process() returns { EndOfWork | EndOfFilter } allows for adaptivity 9 Optimization - Group Instances S A Unit-of-work cycle Communication n Reference DB Filter Group n – Extract ref Extract raw FSP: Abstractions The dynamic assignment of filters to particular hosts for execution is placement(or mapping) Optimization criteria: – View result 3D reconstruction unidirectional buffer pipes uses fixed size buffers (min, good) 7 Placement n high level tasks init,process,finalize interface streams – how filters communicate – Alan Sussman ([email protected]) n based on Active Disks research [Acharya, Uysal, Saltz: ASPLOS’98], dataflow, functional parallelism, message passing. filters – logical unit of computation buf B uow 1 buf buf uow 0 buf Alan Sussman ([email protected]) 10 Experiment - Application Emulator Parameterized dataflow filter P0 F0 C0 – Work – P1 F1 – C1 consume from all inputs compute produce to all outputs in out process out Filter Application emulated: host1 (2 cpu) host2 (2 cpu) – host3 (2 cpu) – Match # instances to environment (CPU capacity, network) Alan Sussman ([email protected]) 11 process 64 units of work single batch P Alan Sussman ([email protected]) F C 12 2 Instances: Vary Number, Application I/O Intensive CPU Intensive 40 Response Time (sec) Response Time (sec) 35 30 25 20 15 10 5 450 400 1 2 4 8 16 mean 32 150 100 1 64 2 4 min 8 16 mean 32 400 350 Response Time (sec) Response Time (sec) 450 400 300 250 200 150 100 50 8 mean max host1 (2 cpu) 16 13 P3 250 Work 200 150 n 0 1 (8) host1 (2 cpu) 2 (32) 4 (8) 8 (4) 16 (4) 32 (2) P4 64 (1) Batch Size (Optimal Num Instances) min mean max – n F3 C3 host2 (2 cpu) R P7 host1 (2 cpu) 15 E C4 F7 C7 Ra0 host2 (2 cpu) hostSMP (8 cpu) Alan Sussman ([email protected]) 16 Runtime Workload Balancing Use local information: M – Rak n work-balance among copies better tune resources to actual filter needs n queue size, send time / receiver acks Adjust number of transparent copies Demand based dataflow (choice of consumer) – – Within a host – perfect shared queue among copies Across hosts Round Robin Weighted Round Robin n Demand-Driven sliding window (on buffer consumption rate) n User-defined Multiple producers and consumers, deadlock, flow control Invariant: UOW i < UOW i+1 n n Problem: filter state Alan Sussman ([email protected]) F4 all Provide “single-stream” illusion – C0 50 FSP Abstraction – F0 100 Optimization - Transparent Copies – 14 Adding Heterogeneity P0 Alan Sussman ([email protected]) n host3 (2 cpu) Alan Sussman ([email protected]) 300 all replicate individual filters transparent host2 (2 cpu) Work issued in instance batches until all complete. Matching # instances to environment (CPU capacity) Setup: (Optimal) is lowest all time Point: # instances depends on application and environment n C1 350 Number of Filter Group Instances min F1 64 max CPU Intensive 450 4 P1 Number of Filter Group Instances max Instances: Vary Number, Batch Size 2 C0 Batch 0 50 Alan Sussman ([email protected]) 1 F0 200 Setup: UMD Red Linux cluster (2 processor PII 450 nodes) Point: # instances depends on application and environment 0 P0 Batch 1 300 250 Number of Filter Group Instances min Batch 0 Batch 2 350 0 0 Group Instances (Batches) 17 Alan Sussman ([email protected]) 18 3 Experiment – Virtual Microscope n n n n Virtual Microscope Results Client-server system for interactively visualizing digital slides Image Dataset (100MB to 5GB per focal plane) Rectangular region queries, multiple data chunk reply Hopkins Linux cluster– 4 1-processor, 1 2-processor PIII-800, 2 80GB IDE disks, 100Mbit Ethernet Decompress filter is most expensive, so good candidate for replication 50 queries at various magnifications, 512x512 pixel output read data decompress clip zoom R-D-C-Z-V Average h-h-h-h-h 2.10 h-g-g-g-g 1.49 h-g(2)-g-g-g 1.15 h-g(2)-g(2)-g-g 1.15 h-g(4)-g(2)-g-g 1.17 h-g(2)-g(2)-b-b 1.68 g-g-g-g-g 1.44 g-g(2)-g-g-g 1.08 view Alan Sussman ([email protected]) 19 n 50x 6.95 4.60 3.41 3.43 3.50 5.34 4.46 3.24 Alan Sussman ([email protected]) Experiment - Isosurface Rendering n Response Time (seconds) 400x 200x 100x 0.38 0.73 1.73 0.37 0.62 1.27 0.39 0.50 0.95 0.37 0.49 0.95 0.39 0.50 0.96 0.45 0.68 1.27 0.33 0.58 1.26 0.33 0.45 0.92 20 Sample Isosurface Visualization UT Austin ParSSim species transport simulation Single time step visualization, read all data Setup: UMD Red Linux cluster (2 processor PII 450 nodes) 1.5 GB 38.6 MB R read dataset 0.64s 4.3% 11.8 MB E isosurface extraction 1.64s 11.2% 28.5 MB Ra shade + rasterize 11.67s 79.5% M merge / view 0.73s 5.0% = 14.68s (sum) = 12.65s (time) Alan Sussman ([email protected]) V = 0.35 21 Alan Sussman ([email protected]) Transparent Copies: Replicate Raster 1 node 1 copy of Raster 12.18s 7.32s 4.17s 3.00s 2 copies of Raster 8.16s (33%) 5.70s (22%) 3.88s (7%) 3.24s (– 8%) Isosurface rendering on Red, Blue, Rogue Linux clusters at Maryland – – – – n Setup: SPMD style, partitioned input dataset per node Point: copies of bottleneck filter enough to balance flow Alan Sussman ([email protected]) 23 22 Experiment – Resource Heterogeneity n 2 nodes 4 nodes 8 nodes V = 0.7 Red – 16 2-processor PII-450, 256MB, 18GB SCSI disk Blue – 8 2-processor PIII-550, 1GB, 2-8GB SCSI disk + 1 8-processor PIII-450, 4GB, 2-18GB SCSI disk Rogue – 8 1 -processor PIII-650, 128MB, 2-75GB IDE disks Red, Blue connected via Gigabit Ethernet, Rogue via 100Mbit Ethernet Two implementations of Raster filter – z-buffer and active pixels (the one used in previous experiment) Alan Sussman ([email protected]) 24 4 Experimental setup 1.5 GB 38.6 MB R read dataset Active Pixel Z-buffer 32.0MB 28.5 MB 11.8 MB E isosurface extraction 0.64s 0.68s Varying number of nodes Ra shade + rasterize 1.64s 1.65s Active Pixel Rendering Z-buffer Active Pixel Configuration #Ra M merge / view 11.67s 9.43s 0.73s 0.90s RE-Ra-M 1(0) n/a 12.7 1(1) 12.2 7.3 4.8 4.2 2.9 n/a 11.3 3.0 10.7 7.9 RE-Ra-M 2(0) 2(2) 3.2 3.9 2.6 3.2 = 14.68s = 12.66s Alan Sussman ([email protected]) 25 n 4.2 6.5 4.1 DD RR WRR DD 3.0 4.4 3.0 3.5 4.0 5.1 9.3 10.5 8.9 11.7 19.0 20.8 26 Experimental setup – Conf. 4.0 n/a 8.6 Skewed data distribution 1 data node 2 data nodes 4 data nodes 8 data nodes 8.2 7.7 5.7 Alan Sussman ([email protected]) Heterogeneous nodes RR WRR DD RR WRR 7.3 3.0 3.0 5.7 2.6 n/a 8.2 7.7 10.7 7.9 11.5 Only Red nodes used – each one runs 1 RE, 0, 1, or 2 RA, and one runs M Experiment to follow combines R and E filters, since that showed best performance in experiments not shown RERa-M RERa-M Z-buffer Rendering 1 2 4 8 1 2 4 8 node nodes nodes nodes node nodes nodes nodes 4.1 4.1 RR WRR DD 3.8 2.9 3.8 4.0 3.7 4.2 – – 25GB dataset from UT ParSSim (bigger grid than earlier experiments) Hilbert curve declustering onto disks on 2 Blue, 2 Rogue nodes Skew moves part of data from Blue to Rogue nodes Active Pixel algorithm on 8 -processor Blue node + Red data nodes Blue node runs 7 Ra or ERa copies and M, Red nodes each run 1 of each except M Alan Sussman ([email protected]) 27 Skewed data distribution 30 25 25 20 RR W RR DD 15 10 5 Multiple group instances 20 – RR 15 WRR DD 10 – 5 R-E Ra-M F il te r# Co n fi g u ration RE Ra-M RE -Ra -M R-E Ra -M – RE-Ra- M Fi l ter#Con fi gu rati o n Higher utilization of under-used resources Reduce application response time but … requires work co-execution tolerance Transparent copies can help Ske we d#7 5%#-#Ac tiv e #Pixe l#Re nde ring Ske w e d#50 %#-#A c tive #Pi xe l#Re nde r ing 30 25 20 RR W RR DD 15 10 5 Exe cu tio n #Ti me # ( s ec on ds ) 30 Ex ec ution #Tim e#(se co nd s ) Experimental Implications 0 0 R ERa -M 28 S ke w e d#2 5 %#-#Ac tiv e #Pix el#Re nde r in g 30 Exe cu tio n # T i me # ( s ec on ds ) Exec u ti o n # T ime # ( s ec on d s) Balanc e d#-#Ac tive #Pixe l# R e nde ring Alan Sussman ([email protected]) – 25 20 RR WRR 15 – DD 10 – 5 Most filter decompositions are unbalanced Heterogeneous CPU capacity / load but … requires buffer out-of-order tolerance 0 0 R ERa -M R-E Ra-M F il te r# Co n fi g u ration RE -Ra -M RE Ra-M R-E Ra -M RE-Ra- M Fi l ter#Con fi gu rati o n Alan Sussman ([email protected]) 29 Alan Sussman ([email protected]) 30 5 Ongoing and Future Work Ongoing and Future work – – – – – Automated placement, instances, transparent copies – predictive (cost models) – adaptive (work feedback) Filter accumulator support (partitioning, replication) Java filters CCA-compliant filters Very large datasets – including HDF5 format – Using storage clusters at UMD and OSU, then testbed at LLNL Alan Sussman ([email protected]) 31 6
© Copyright 2026 Paperzz