Accelerating Intelligence for [Insert Name] (Note: There are other

When the Grid Comes to Town
Chris Smith, Senior Product Architect
Platform Computing
[email protected]
LSF 6.0 Feature Overview
Comprehensive Set of Intelligent Scheduling Policies
Goal-oriented SLA Scheduling
Queue-Based Fairshare Enhancements
Job Groups
Advanced Self-Management
Job-level Exception Management
Job Limit Enhancements
Non-normalized Job Run Limit
Resource Allocation Limit Display
2
© Platform Computing Inc. 2004
LSF 6.1 was focused on performance and scalability
Scalability Targets
5K hosts per cluster
500K active jobs at any one time
100 concurrent users executing LSF commands
1M completed jobs per day
Performance Targets
90% min slot utilization
5 seconds max command response time
20 seconds real pending reason time
4Kb max mem usage per job (mbatchd+mbschd)
5 minutes max master failover time
2 minutes max reconfig time
3
© Platform Computing Inc. 2004
Performance, Reliability and Scalability
Industry leading performance, reliability, & scalability
Supporting the largest and most demanding enterprise clusters
Extending leadership over the competition
4
Feature
Faster response times for user submission &
query commands
Faster scheduling and dispatch times
Benefits
Improved user experience
Faster master fail-over
Improved availability, minimizes downtime
Dynamic host membership improvements –
host group supported
Reduced administration effort, higher degree
of self-management
Pending job management - Limiting the
number of pending jobs
Prevents the accidental overloading of the
cluster with error jobs
© Platform Computing Inc. 2004
Increased throughput and cluster utilization
Results – Platform LSF V6.0 vs V6.1
job
slot
throughput
utilization
(jobs/hr)
LSF 6.0
(3K, 50K)
LSF 6.1
(3K, 50K)
LSF 6.1
(3K, 100K)
LSF 6.1
(3K, 500K)
LSF 6.1
(5K, 100K)
LSF 6.1
(5K, 500K)
CLI
CLI
LSF
failover
job
response response daemon not
time
reconfig
memory
time
time
respond (mbdresta
time
(bjobs)
(bqueues)
msgs ??
rt)
< 4K
< 5 sec
< 5 sec
NO
< 300s
< 120s
> 90%
-
74%
45,117.70
8.39
67.83
64.82
YES
637
181
94%
68,960.60
1.38
0.86
0.48
NO
200
82
94%
66,635.40
1.52
0.90
0.49
NO
93%
70,773.90
1.08
1.00
0.91
NO
318
58
79%
90,017.40
1.29
1.72
1.16
NO
73%
77,947.90
1.11
1.68
1.20
NO
When we tested Platform LSF V6.0 with 100K job load, we observed that mbatchd size
increased to 1.3GB and used 99.8% CPU.
5
© Platform Computing Inc. 2004
Grid Computing Issues
Grid level scheduling changes some things
With the wider adoption of computing Grids as access mechanisms to local
cluster resources, some of the requirements for the cluster resource manager
have changed.
Users are coming from different organizations. Have they been authenticated?
Do they have a user account?
I have to stage in data from where!??
Local policies must reflect some kind of balance between meeting local user
requirements, and promoting some level of sharing.
How can the sites involved in a Grid get an idea what kind of workload is
being run, and how it impacts the resources?
How can users access resources without needing a 30” display to show load
graphs and queue lengths for the 10 different clusters they have access to?
Thinking about these issues can keep one awake at night.
7
© Platform Computing Inc. 2004
Grid Identities are not UNIX user identities
Traditionally, LSF’s notion of users is very much tied to the UNIX user
identity
Local admins must define local users for all users of the system
Can use some (brittle) form of user name mapping
Grid middleware (globus based) uses the GSI (PKI)
Grid map file maps users to local uids
Same management nightmare
Grid users are usually “second class citizens”
It would be nice to have some identity model where both grid and local
scheduler shared a notion of a consumer, and perhaps allowed more
flexible use of local user account (e.g. Legion)
8
© Platform Computing Inc. 2004
Where are applications located and how are they configured
Users get used to their local configurations
local installations of applications
environment variable names
there is a learning curve per site
Need some kind of standardization
could do Teragrid style software stack standardization, but this is very
inflexible
need a standardized job description database
application location
local instantiation of environment variables
tie in with DRMAA job category
Platform PS people used the “jsub” jobstarter
Are provisioning services the answer?
would be nice to dynamically install an application image and environment on
demand with a group of jobs
9
© Platform Computing Inc. 2004
How do administrator’s set scheduler policy?
It’s probably easiest to make those pesky grid users second class
citizens (back to the identity issue)
A federated identity system (based on user’s role within a VO) could
make sure that they get into the “right queue”
There are too many tuneables within local schedulers. Would be nice to
have some kind of “self configuration” based on higher level policies
Platform’s goal based scheduling (project based scheduling)
Current “goals” include deadline, throughput, and velocity
How are resources being used, and who is doing what?
Need some kind of insight into the workload, users and projects
Needs to be “VO aware”
Something like Platform’s analytics packages
10
© Platform Computing Inc. 2004
Data set management/movement for batch jobs
Should a job go to its data, or should data flow to a job
current schedulers don’t take this into consideration
ideally would like to flow jobs using the same data to a site (set of
hosts) which have already “cached” the data
but where’s the sweet spot where this becomes a hot spot?
The scheduler’s job submission mechanism (both local and Grid) need
to be able to specify data set usage, and the scheduler should use this
as a factor in scheduling
Moreover, there needs to be some kind of feedback loop between the
flowing of data between sites and the flowing of jobs between sites
If I had a predictive scheduler, I could have data transfers happen “just
in time”
11
© Platform Computing Inc. 2004
Platform’s Activities
So how do we find the solution to these issues?
We (Platform) need some experience working within Grid environments.
CSF (Community Scheduler Framework - not RAL’s scheduler)
provides a framework we can use to experiment with metascheduling
concepts and issues
But there aren’t the wide array of features or the scalability we have in LSF
Why not use LSF itself as a metascheduler?
We are engaged in Professional Services contracts doing this right now
Sandia National Lab - Job Scheduler interface to many PBS resources
using LSF as the bridge. Integrates Kerberos and external file transfer.
National Grid Office of Singapore - LSF (and its WebGUI) will be the
interface to computing resources at multiple sites. There are PBS, SGE
and LL clusters (some with Maui). Automatic matching of jobs to
clusters is desired.
13
© Platform Computing Inc. 2004
CSF Architecture
Metascheduler
Plugin
Platform
LSF User
Globus
Toolkit User
LSF
Grid Service Hosting
Environment
Meta-Scheduler
Global
Information
Service
RIPS
RIPS = Resource Information
Provider Service
14
© Platform Computing Inc. 2004
Job
Service
GRAM
SGE
RIPS
GRAM
PBS
Reservation
Service
RIPS
Queuing
Service
RM
Adapter
Platform LSF
SGE
PBS
LSF as a Metascheduler
60,000ft
Job Scheduler
Web Portal
MultiCluster
LSF Scheduler
Cluster/Desktops
Cluster/Desktops
LSF
15
© Platform Computing Inc. 2004
LSF Scheduler
PBS
SGE
LL
Data Centric Scheduling
The solution comes in two parts:
Data Centric Scheduling
Dispatch compute jobs to machines to which the cost of accessing
data is “cheapest”
cache aware scheduler
topology aware scheduler e.g. uses distance vectors to measure
how far a host is from a data set
Workload Driven Data Management
Just as the workload scheduler is cognizant of data locality, a data
manager needs to be cognizant of future workload that will exercise
given data sets
If data sets can be transferred before they are needed, the latency of
synchronous data transfer is mitigated
16
© Platform Computing Inc. 2004
Data cache aware scheduling
Site 1 – MOL, MOL2
2. Update cache info
Site 2 – (none)
Data Management
Service
Site 3 - MOL
4. Local site is overloaded
data cache aware
scheduler plug-in decides
to forward the job to site 3,
since it has the MOL
database
Site 1
1. Poll for datasets
Site 3
5. Job forwarded to site 3
3. bsub -extsched MOL
MOL
MOL2
MOL
Site 2
17
© Platform Computing Inc. 2004
Goal-Oriented SLA-Driven Scheduling
What is it?
Goal-oriented "just-in-time" scheduling policy
Unlike current scheduling policies based on configured shares or limits,
SLA-driven scheduling is based on customer provided goals:
Deadline based goal: Specify the deadline for a group of jobs
Velocity based goal: Specify the number of jobs running at any one time
Throughput based goal: Specify the number of finished jobs per hour
Allows users to focus on the "what and when" of a project instead of
"how"
18
© Platform Computing Inc. 2004
Goal-Oriented SLA-Driven Scheduling
Benefits
Guarantees projects are completed on time according to explicit SLA
definitions
Provides visibility into the progress of projects to see how well
projects are tracking to SLAs
Allows the admin focus on “What work and When” needs to be done,
not “how” the resources are to be allocated
Guarantees service level deliveries to the user community, reduces
the risks of projects and administration cost
19
© Platform Computing Inc. 2004
Summary
Local scheduler technology continues to progress well …. within the
cluster.
Grid level schedulers raise issues which haven’t been dealt with before
cluster users are no longer “local”
local scheduling policies aren’t really applicable
data management and environment management is more difficult
Platform is working to solve some of these issues
implementing meta-schedulers
researching new scheduling policies
Need to work closely with the HEP community since they are causing
the biggest problems!
20
© Platform Computing Inc. 2004
Questions?