Predicting the Impact of Configuration Changes Jeff Hollingsworth Hyeonsang Eom University of Maryland 1 A Family of Simulators l Explore accuracy vs. time trade-off – Use simple static estimation of I/O and communication – Exploring adding stochastic variation l Simplifying assumptions – no network link contention – predictable computation/communication interference – infinite memory University of Maryland 2 DumbSim l Very Fast, Optimistic Simulator – assumes perfect overlap of I/O and computation – ignores block producer-consumer relationship l l Epochs used for intra-node synchronization Is embarrassingly parallel Node 2 Node 1 Time Disk Time CPU Time Epoch Boundry Disk Time CPU Time Comm. Time Comm. Time Max University of Maryland Epoch Boundry Max 3 FastSim: Fast Simulator l Flexible event processing loop – round-robin: process next event for each node • most accurate when load is balanced – discrete event: find earliest time of next event • more overhead than round-robin l Uses Graph to update timing for each resource Disk Time MAX Comm. Time MAX Comm. Time Node 1 Node 2 University of Maryland CPU Time MAX 4 Titan Emulator (SDSC Machine) Simulation time (msecs) P re d ic te d tim e (s e c s ) Scaled Input 250 200 150 IBM SP2 Dumbsim Fastsim Gigasim Petasim 100 50 0 16 32 64 1000000 100000 10000 1000 100 10 1 100 16 32 64 100 Non-scaled Input 100000 70 IBM SP2 Dumbsim Fastsim Gigasim Petasim 60 50 40 30 20 10 0 Simulation time (msecs) Predicted time (secs) 80 10000 1000 100 10 1 16 32 64 100 University of Maryland 16 32 64 100 5 Pathfinder Emulator (SDSC Machine) Scaled Input 1000000 IBM SP2 Dumbsim Fastsim Gigasim Petasim 60 50 40 Sim ulation time (m secs) Predicted time (secs) 70 30 20 10 100000 10000 1000 100 10 0 1 16 32 64 16 100 32 64 100 Varying IO/Compute Node Ratio 100000 30 Sim ulation tim e (m secs) Predicted tim e (secs) 35 IBM SP2 Dumbsim Fastsim Gigasim Petasim 25 20 15 10 5 0 10000 1000 100 10 1 4/60 8/56 16/48 32/32 48/16 56/8 60/4 University of Maryland 4/60 8/56 16/48 32/32 48/16 56/8 60/4 6 Virtual Microscope (SDSC Machine) IBM SP2 20 Dumbsim 15 Fastsim 10 Gigasim Petasim 5 0 16 32 64 100 # of processors simulated 10000 Simulation time (msecs) Predicted time (secs) 25 1000 100 10 1 16 32 64 100 # of processors simulated University of Maryland 7 Scaling up the number of Nodes Virtual Microscope Application Emulator 10000 140 S im u latio n tim e (secs) P redicted tim e (secs) 160 120 100 80 60 40 20 Dumbsim 1000 Fastsim Gigasim 100 10 0 1 256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 1024 2048 4096 8192 Pathfinder Application Emulator 45000 100000 Simulation time (secs) Predicted time (secs) 40000 35000 30000 25000 20000 15000 10000 5000 0 256 512 1024 2048 4096 8192 10000 1000 100 10 1 University of Maryland 256 512 8 Summary of I/O Results l Application Emulators – can generate complex I/O patterns quickly. – enable efficient simulation of large systems. l Family of Simulators – permits cross checking results. – allows trading simulation speed and accuracy. University of Maryland 9 Critical Path Profiling l Critical Path – Longest path through a parallel program – To speedup program, must reduce path l Critical Path Profile – Time each procedure is on the critical path l CP Zeroing – compute the CP as if the a procedure’s time is 0. – use a variation of online CP algorithm • CPnet = CP - Share • at receive, keep tuple with largest CPnet University of Maryland 10 Program Activity Graph initial node 0 0 0 call(c) call(a) 4 call(c) 4 call(b) startSend 1 1 endSend startSend 2 endSend 0 7 1 startRecv 0 endRecv 0 5 call(a) 8 0 startRecv 0 endRecv 8 final node University of Maryland 11 NAS IS Application Procedure nas_is_ben create_seq do_rank CP % CP CPU % CPU 12.4 56.4 54.8 74.1 9.2 42.0 9.2 12.4 0.4 1.6 9.2 12.5 – create_seq is more important than CPU time indicates. – do_rank is ranked higher than create_seq by CPU time. University of Maryland 12 Load Balancing Factor l Key Idea: what-if we move work – length of activity remains the same – where computation is performed changes l Two Granularities Possible – process level • process placement or migration – procedure level • function shipping • fine grained thread migration University of Maryland 13 Process LBF l What-if we change processor assignment – predict execution time on larger configurations – try out different allocations l Issues: – changes in communication cost • local vs. non-local communications – interaction with scheduling policy • how are nodes shared? • assume round robin University of Maryland 14 Computing Load Balancing Factor Start Start 0 Call(a) 0 Call(b) 0 0 Call(a) Call(c) 1 1 3 startRecv startRecv send 4 Call(b) startRecv 1 send p3 0 p1 1 p2 4 p3 endRecv p2 1 0 0 endRecv 0 0 p1 0 End End P1 startRecv 2 send 0 1 P2 1 endRecv p3 1 endRecv send Call(c) P1 1 0 0 0 P2 P3 Program Activity Graph University of Maryland G1 G2 Group Activity Graph 15 Using Paradyn to Implement Process LBF ä forward data from application to monitor – Need to forward events to central point – supports samples – requires extensions to data collection system ä provides dynamic control of data collection – only piggy pack instrumentation on demand ä need to correlate data from different nodes – use $globalId MDL variable University of Maryland 16 Results : Accuracy Measured Time on 16 Processors Predicted Time for 16 Processors on 16 Processors Predicted Time for 16 Processors on 8 Processors 300 Time (sec) 250 200 150 100 50 0 EP FT University of Maryland IS MG 17 LBF Overhead (16 nodes) Measured Time W/o Instrumentation Measured Time W/ Instrumentation 350 300 Time (sec) 250 200 150 100 50 0 EP FT University of Maryland IS MG 18 Changing Network and Processes Change: # of nodes (8->16) network (10Mbps Ethernet -> 320Mbps HPS) Measured Tim e on 16 processors w ith HPS Predicted Tim e w hen run on 8 Processors w ith Ethernet 300 Time (sec) 250 200 150 100 50 0 EP FT University of Maryland IS MG 19 Linger Longer l Many Idle Cycles on Workstations – Even when users are active, most processing power not used l Idea: Fine-grained cycle stealing – Run processes a very low priority – Migration becomes an optimization not a necessity l Issues: – How long to Linger? – How much disruption of foreground users • delay of local jobs: process switching • virtual memory interactions University of Maryland 20 Simulation of Policies l Model workstation as – foreground process (high priority) • requests CPU, then blocks • hybrid of trace-based data and model – background process (low priority) • always ready to run, and have a fixed CPU time – context switches (each takes 100 micro-seconds) • accounts for both direct state and cache re-load l Study: – What is the benefit of Lingering? – How much will lingering slow foreground processes? University of Maryland 21 Migration Policies l Immediate Eviction (IE) – when a user returns, migrate the job – policy used by Berkeley NOW – assumes free workstation or no penalty to stop job l Pause and Migrate (PM) – when a user returns, migrate the job – used by Wisconsin condor l Linger Longer (LL) – when user returns, decrease priority and remain • monitor situation to decide when to migrate – permits fine grained cycle stealing l Linger Forever (LF) – like Linger Longer, but never migrate University of Maryland 22 Simulation Results - Sequential Workload 2000 2000 1500 1500 Average Time (sec) Average tim e (sec) – LF is fastest, but variation is higher than LL – LL and LF have lower variation than IE or PM. – Slowdown for foreground jobs is under 1%. 1000 500 AVG MIGR TIME AVG LINGER TM 1000 AVG RUN TIME AVG IN Q TIME 500 0 0 LL LF IE PM LL Migration Policy LF IE PM Migration Policy – LF is a 60% improvement over the PM policy. University of Maryland 23 Simulation Results - Parallel Applications – Use DSM Applications on non-idle workstations – Assumes 1.0 Gbps LAN – Compare Lingering vs. reconfiguration fft (16 node, nonidle lusg 20%) linger water (16node, nonidle lusg 20%) reconfiguration linger reconfiguration 6 Time (Billion cycles) Time (Billion cycles) 7 5 4 3 2 1 0 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 16 15 14 13 12 11 10 idle node 9 8 7 6 5 4 3 2 1 0 idle node – Lingering is often faster than reconfiguration! University of Maryland 24 Future Directions l Wide Area Test Configuration – simulate high latency/high bandwidth network – a controlled testbed for wide area computing l Parallel Computing on non-dedicated clusters – current simulations show promise, but ... • need to include data about memory hierarchy • real test is to build the system l Development of the Metric and Option Interface – prototype applications that can adapt to change – evaluate different adaptation policies University of Maryland 25
© Copyright 2025 Paperzz