On Grid Performance Evaluation using Synthetic Workloads

DGSim: Comparing Grid Resource
Management Architectures
Through Trace-Based Simulation
Alexandru Iosup, Ozan Sonmez, and Dick Epema
PDS Group
Delft University of Technology
The Netherlands
Euro-Par 2008, Las Palmas, 27 August 2008
1
A Grid Research Toolbox
• Hypothesis: (a) is better than (b).
For scenario 1, …
1
3
DGSim
2
Euro-Par 2008, Las Palmas, 27 August 2008
2
A Grid Research Toolbox
• Hypothesis: (a) is better than (b).
For scenario 1, …
1
3
DGSim
2
Euro-Par 2008, Las Palmas, 27 August 2008
3
The Problem with Grid Simulations
• Three decades of writing simulators in computer science
→ writing the simulator is not the problem
• The problem: getting from solution design to
experimental results with an automated simulation tool
• Experimental setup
• Tool to generate realistic experimental setups
• Experiment support for grid resource management
• Tool to manage large numbers of related simulations
• Performance
• Not the simulation time (decades of optimizations there)
• Tool proved to work with large simulations (number of
resources, workload size, etc.)
Euro-Par 2008, Las Palmas, 27 August 2008
4
Outline
1.
2.
3.
4.
5.
Problem Statement
The DGSim Framework
DGSim Validation
DGSim Examples
Future Work
Euro-Par 2008, Las Palmas, 27 August 2008
5
2. The DGSim Framework
Name, Goal, and Challenges
• DGSim = Delft Grid Simulator
• Simulate various grid resource management architectures
• Multi-cluster grids
• Grids of grids (THE grid)
• Challenges
Two GRM architectures
• Many types of architectures
• Generating and replaying grid workloads
• Management of the simulations
•
•
•
•
Many repetitions of a simulation for statistical relevance
Simulations with many parameters
Managing results (e.g., analysis tools)
Enabling collaborative experiments
Euro-Par 2008, Las Palmas, 27 August 2008
6
2. The DGSim Framework
Overview
Discrete-Event
Simulator
Euro-Par 2008, Las Palmas, 27 August 2008
7
2. The DGSim Framework
Model Details: Inter-Operation Architectures
Independent
Hierarchical
Centralized
Decentralized
Hybrid
hierarchical/
decentralized
Euro-Par 2008, Las Palmas, 27 August 2008
8
2. The DGSim Framework
Model Details: Resource Dynamics & Evolution
• Resource dynamics
• Short-term changes in resource availability status
A. Iosup, M. Jan, O. Sonmez, and D.H.J. Epema, On the Dynamic
Resource Availability in Grids, IEEE/ACM Grid, 2007.
• Resource evolution
• Long-term changes in number & … of resources
Euro-Par 2008, Las Palmas, 27 August 2008
9
2. The DGSim Framework
Workloads: Generation and Model(s)
• Workload Generation
• Generate synthetic workload with realistic characteristics
• Iterative workload generation: incur specified load on a grid
• Parallel jobs
• Adapting the Lublin-Feitelson model to grids
A. Iosup, D.H.J.Epema, T. Tannenbaum, M. Farrellee, M. Livny,
Inter-Operating Grids through Delegated MatchMaking, ACM/IEEE
SuperComputing, 2007.
• Bags-of-Tasks: groups of independent single-processor tasks
• Validated with seven long-term grid traces
A. Iosup, O.O. Sonmez, S. Anoep, D.H.J.Epema, The Performance
of Bags-of-Tasks in Large-Scale Distributed Computing
Systems, IEEE HPDC, 2008.
Euro-Par 2008, Las Palmas, 27 August 2008
10
Outline
1.
2.
3.
4.
5.
Problem Statement
The DGSim Framework
DGSim Validation
DGSim Examples
Future Work
Euro-Par 2008, Las Palmas, 27 August 2008
11
3. DGSim Validation
Functional Validation
• Functional validation (simple scenario)
• Workload = 100 jobs ct. size 10,000 arrive at t=0
• System: grid scheduler over one 10-resource cluster
resource = 1 work unit/second, information delay = 0-3600s
Euro-Par 2008, Las Palmas, 27 August 2008
12
3. DGSim Validation
Real vs. Simulated DAS-3 Multi-Cluster Grid
• Simulator setup
• Application: synthetic parallel, communication-intensive (all-gather)
Measured: runtime for various configurations (co-allocation)
• System: heterogeneous clusters, Koala co-allocating scheduler
• Workload: 300 jobs, submitted over a period of 6 hours
• All jobs submitted through central cluster gateways
• Results
• Scheduling algorithm leads to similar results in real and simulated
environments → can use simulator for analyzing scheduling trends
• Under-estimation of waiting time (failures lead to more contention)
Euro-Par 2008, Las Palmas, 27 August 2008
13
Outline
1.
2.
3.
4.
5.
Problem Statement
The DGSim Framework
DGSim Validation
DGSim Examples
Future Work
Euro-Par 2008, Las Palmas, 27 August 2008
14
4. DGSim Examples
Sample 1/3
• Investigate mechanisms
for inter-operating grids
• New mechanism: DMM
• Trace-based performance
evaluation through simulations
• Real and model-based traces
• Largest trace: 1.4M jobs
• Simulate Grid’5000+DAS-2
• Explored a design space of over
1 million design points
A. Iosup, D.H.J.Epema, T. Tannenbaum, M.
Farrellee, M. Livny, Inter-Operating Grids
through Delegated MatchMaking, ACM/IEEE
SuperComputing, 2007.
Euro-Par 2008, Las Palmas, 27 August 2008
15
4. DGSim Examples
Sample 2/3
Availability
Information
Delay
• What is the performance
impact of the dynamic grid
resource availability?
Long period
AMA
Short period
Avg. Norm. G’put.
[cpuseconds/day/proc]
Avg. Norm. G'put [cpus/day/proc]
• Four models for grid resource
availability information
• Trace-based performance evaluation
through simulations
• Real traces
• Simulate Grid’5000
• KA = AMA > HMA >> SA
A. Iosup, M. Jan, O. Sonmez,
and D.H.J. Epema, On the
Dynamic Resource Availability
in Grids, IEEE/ACM Grid, 2007.
HMA
On-Time (0)
SA
KA
Static Dynamic
Resource availability
15,000
10,000
5,000
Goodput decreases with
intervention delay
SA
KA
AMA
AMA
HMA 1wk
HMA 1mo
HMA Fixed
SA KA
16
Euro-Par 2008, Las Palmas, 27 August
2008 AMA AMA HMA HMA HMA
60s 1h 1w 1mo Never
Model
Model
4. DGSim Examples
Sample 3/3
• Analyze performance of bagof-tasks scheduling algorithms
Resource
Information
• Information availability framework:
Known, Unknown, Historical records
• Trace-based performance evaluation
through simulations
• Real and model-based traces
• Simulate Grid’5000+DAS
• Evaluated 8 scheduling algorithms
• Explored a design space of over
2 million design points
Task
Information
K
K
H
U
ECT,
FPLT
ECT-P
FPF
H DFPLT,
MQD
U
STFR
RR,
WQR
A. Iosup, O.O. Sonmez, S. Anoep, D.H.J.Epema, The Performance
of Bags-of-Tasks in Large-Scale Distributed Computing Systems,
IEEE HPDC, 2008.
Euro-Par 2008, Las Palmas, 27 August 2008
17
Outline
1.
2.
3.
4.
5.
Problem Statement
The DGSim Framework
DGSim Validation
DGSim Examples
Future Work
Euro-Par 2008, Las Palmas, 27 August 2008
18
Conclusion and Future Work
• The DGSim framework
• Tool to generate realistic experimental setups
• Tool to manage large numbers of grouped simulations
• Tool proved to work with large simulations
• Validated underlying models and assumptions
• Resource dynamics and evolution model
• Workload model
• Comparing grid resource management architectures
• Proven in various settings
• Future work
• More scenarios
• Library of ready-to-use scenarios
Euro-Par 2008, Las Palmas, 27 August 2008
19
Thank you! Questions? Remarks? Observations?
• Contact: [email protected] [google “Iosup“]
• Web sites:
o http://www.vl-e.nl : VL-e project
o http://www.pds.ewi.tudelft.nl : PDS group articles & software
Euro-Par 2008, Las Palmas, 27 August 2008
20