Predicting the Impact of Configuration Changes

Predicting the Impact of
Configuration Changes
Jeff Hollingsworth
Hyeonsang Eom
University of Maryland
1
A Family of Simulators
l
Explore accuracy vs. time trade-off
– Use simple static estimation of I/O and communication
– Exploring adding stochastic variation
l
Simplifying assumptions
– no network link contention
– predictable computation/communication interference
– infinite memory
University of Maryland
2
DumbSim
l
Very Fast, Optimistic Simulator
– assumes perfect overlap of I/O and computation
– ignores block producer-consumer relationship
l
l
Epochs used for intra-node synchronization
Is embarrassingly parallel
Node 2
Node 1
Time
Disk
Time
CPU
Time
Epoch
Boundry
Disk
Time
CPU
Time
Comm. Time
Comm. Time
Max
University of Maryland
Epoch
Boundry
Max
3
FastSim: Fast Simulator
l
Flexible event processing loop
– round-robin: process next event for each node
• most accurate when load is balanced
– discrete event: find earliest time of next event
• more overhead than round-robin
l
Uses Graph to update timing for each resource
Disk
Time
MAX
Comm. Time
MAX
Comm. Time
Node 1
Node 2
University of Maryland
CPU
Time
MAX
4
Titan Emulator (SDSC Machine)
Simulation time (msecs)
P re d ic te d tim e (s e c s )
Scaled Input
250
200
150
IBM SP2
Dumbsim
Fastsim
Gigasim
Petasim
100
50
0
16
32
64
1000000
100000
10000
1000
100
10
1
100
16
32
64
100
Non-scaled Input
100000
70
IBM SP2
Dumbsim
Fastsim
Gigasim
Petasim
60
50
40
30
20
10
0
Simulation time (msecs)
Predicted time (secs)
80
10000
1000
100
10
1
16
32
64
100
University of Maryland
16
32
64
100
5
Pathfinder Emulator (SDSC Machine)
Scaled Input
1000000
IBM SP2
Dumbsim
Fastsim
Gigasim
Petasim
60
50
40
Sim ulation time (m secs)
Predicted time (secs)
70
30
20
10
100000
10000
1000
100
10
0
1
16
32
64
16
100
32
64
100
Varying IO/Compute Node Ratio
100000
30
Sim ulation tim e (m secs)
Predicted tim e (secs)
35
IBM SP2
Dumbsim
Fastsim
Gigasim
Petasim
25
20
15
10
5
0
10000
1000
100
10
1
4/60
8/56
16/48
32/32
48/16
56/8
60/4
University of Maryland
4/60
8/56
16/48
32/32
48/16
56/8
60/4
6
Virtual Microscope (SDSC Machine)
IBM SP2
20
Dumbsim
15
Fastsim
10
Gigasim
Petasim
5
0
16
32
64
100
# of processors simulated
10000
Simulation time (msecs)
Predicted time (secs)
25
1000
100
10
1
16
32
64
100
# of processors simulated
University of Maryland
7
Scaling up the number of Nodes
Virtual Microscope Application Emulator
10000
140
S im u latio n tim e (secs)
P redicted tim e (secs)
160
120
100
80
60
40
20
Dumbsim
1000
Fastsim
Gigasim
100
10
0
1
256
512
1024
2048
4096
8192
256
512
1024
2048
4096
8192
1024
2048
4096
8192
Pathfinder Application Emulator
45000
100000
Simulation time (secs)
Predicted time (secs)
40000
35000
30000
25000
20000
15000
10000
5000
0
256
512
1024
2048
4096
8192
10000
1000
100
10
1
University of Maryland
256
512
8
Summary of I/O Results
l
Application Emulators
– can generate complex I/O patterns quickly.
– enable efficient simulation of large systems.
l
Family of Simulators
– permits cross checking results.
– allows trading simulation speed and accuracy.
University of Maryland
9
Critical Path Profiling
l
Critical Path
– Longest path through a parallel program
– To speedup program, must reduce path
l
Critical Path Profile
– Time each procedure is on the critical path
l
CP Zeroing
– compute the CP as if the a procedure’s time is 0.
– use a variation of online CP algorithm
• CPnet = CP - Share
• at receive, keep tuple with largest CPnet
University of Maryland
10
Program Activity Graph
initial node
0
0
0
call(c)
call(a)
4
call(c)
4
call(b)
startSend
1
1
endSend
startSend
2
endSend
0
7
1
startRecv
0
endRecv
0
5
call(a)
8
0
startRecv
0
endRecv
8
final node
University of Maryland
11
NAS IS Application
Procedure
nas_is_ben
create_seq
do_rank
CP % CP CPU % CPU
12.4
56.4 54.8
74.1
9.2
42.0
9.2
12.4
0.4
1.6
9.2
12.5
– create_seq is more important than CPU time indicates.
– do_rank is ranked higher than create_seq by CPU time.
University of Maryland
12
Load Balancing Factor
l
Key Idea: what-if we move work
– length of activity remains the same
– where computation is performed changes
l
Two Granularities Possible
– process level
• process placement or migration
– procedure level
• function shipping
• fine grained thread migration
University of Maryland
13
Process LBF
l
What-if we change processor assignment
– predict execution time on larger configurations
– try out different allocations
l
Issues:
– changes in communication cost
• local vs. non-local communications
– interaction with scheduling policy
• how are nodes shared?
• assume round robin
University of Maryland
14
Computing Load Balancing Factor
Start
Start
0
Call(a)
0
Call(b)
0
0
Call(a)
Call(c)
1
1
3 startRecv
startRecv
send
4
Call(b)
startRecv
1
send
p3
0
p1
1
p2
4
p3
endRecv
p2
1
0
0
endRecv
0
0
p1
0
End
End
P1
startRecv
2
send
0
1
P2
1
endRecv
p3
1
endRecv
send
Call(c)
P1
1
0
0
0
P2
P3
Program Activity Graph
University of Maryland
G1
G2
Group Activity Graph
15
Using Paradyn to Implement Process LBF
ä
forward data from application to monitor
– Need to forward events to central point
– supports samples
– requires extensions to data collection system
ä
provides dynamic control of data collection
– only piggy pack instrumentation on demand
ä
need to correlate data from different nodes
– use $globalId MDL variable
University of Maryland
16
Results : Accuracy
Measured Time on 16 Processors
Predicted Time for 16 Processors on 16 Processors
Predicted Time for 16 Processors on 8 Processors
300
Time (sec)
250
200
150
100
50
0
EP
FT
University of Maryland
IS
MG
17
LBF Overhead (16 nodes)
Measured Time W/o Instrumentation
Measured Time W/ Instrumentation
350
300
Time (sec)
250
200
150
100
50
0
EP
FT
University of Maryland
IS
MG
18
Changing Network and Processes
Change: # of nodes (8->16)
network (10Mbps Ethernet -> 320Mbps HPS)
Measured Tim e on 16 processors w ith HPS
Predicted Tim e w hen run on 8 Processors w ith Ethernet
300
Time (sec)
250
200
150
100
50
0
EP
FT
University of Maryland
IS
MG
19
Linger Longer
l
Many Idle Cycles on Workstations
– Even when users are active, most processing power
not used
l
Idea: Fine-grained cycle stealing
– Run processes a very low priority
– Migration becomes an optimization not a necessity
l
Issues:
– How long to Linger?
– How much disruption of foreground users
• delay of local jobs: process switching
• virtual memory interactions
University of Maryland
20
Simulation of Policies
l
Model workstation as
– foreground process (high priority)
• requests CPU, then blocks
• hybrid of trace-based data and model
– background process (low priority)
• always ready to run, and have a fixed CPU time
– context switches (each takes 100 micro-seconds)
• accounts for both direct state and cache re-load
l
Study:
– What is the benefit of Lingering?
– How much will lingering slow foreground processes?
University of Maryland
21
Migration Policies
l
Immediate Eviction (IE)
– when a user returns, migrate the job
– policy used by Berkeley NOW
– assumes free workstation or no penalty to stop job
l
Pause and Migrate (PM)
– when a user returns, migrate the job
– used by Wisconsin condor
l
Linger Longer (LL)
– when user returns, decrease priority and remain
• monitor situation to decide when to migrate
– permits fine grained cycle stealing
l
Linger Forever (LF)
– like Linger Longer, but never migrate
University of Maryland
22
Simulation Results - Sequential Workload
2000
2000
1500
1500
Average Time (sec)
Average tim e (sec)
– LF is fastest, but variation is higher than LL
– LL and LF have lower variation than IE or PM.
– Slowdown for foreground jobs is under 1%.
1000
500
AVG MIGR TIME
AVG LINGER TM
1000
AVG RUN TIME
AVG IN Q TIME
500
0
0
LL
LF
IE
PM
LL
Migration Policy
LF
IE
PM
Migration Policy
– LF is a 60% improvement over the PM policy.
University of Maryland
23
Simulation Results - Parallel Applications
– Use DSM Applications on non-idle workstations
– Assumes 1.0 Gbps LAN
– Compare Lingering vs. reconfiguration
fft (16 node, nonidle lusg 20%)
linger
water (16node, nonidle lusg 20%)
reconfiguration
linger
reconfiguration
6
Time (Billion cycles)
Time (Billion cycles)
7
5
4
3
2
1
0
16 15 14 13 12 11 10
9
8
7
6
5
4
3
2
1
0
9
8
7
6
5
4
3
2
1
0
16 15 14 13 12 11 10
idle node
9
8
7
6
5
4
3
2
1
0
idle node
– Lingering is often faster than reconfiguration!
University of Maryland
24
Future Directions
l
Wide Area Test Configuration
– simulate high latency/high bandwidth network
– a controlled testbed for wide area computing
l
Parallel Computing on non-dedicated clusters
– current simulations show promise, but ...
• need to include data about memory hierarchy
• real test is to build the system
l
Development of the Metric and Option Interface
– prototype applications that can adapt to change
– evaluate different adaptation policies
University of Maryland
25