proof - Nikhef

PROOF: the Parallel ROOT Facility
Scheduling and
Load-balancing
ACAT 2007
Jan Iwaszkiewicz ¹ ²
Gerardo Ganis ¹
Fons Rademakers ¹
¹ CERN PH/SFT
² University of Warsaw
Outline
• Introduction to Parallel ROOT Facility
• Packetizer – load balancing
• Resource Scheduling
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
2
Analysis of the
Large Hadron Collier data
• Necessity of distributed analysis
• ROOT – popular particle physics data
analysis framework
• PROOF (ROOT’s extension) –
automatically parallelizes processing to
computing clusters or multicore machines
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
3
Who is using PROOF
• PHOBOS
– MIT, dedicated cluster, interfaced with Condor
– Real data analysis, in production
Very positive experience
• functionality, large speedup, efficient
But not really the LHC scenario
• Usage limited to a few experienced users
• ALICE
– CERN Analysis Facility (CAF)
• CMS
– Santander group, dedicated cluster
– Physics TDR analysis
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
4
Using PROOF: example
• PROOF is designed for analysis of independent objects,
•
e.g. ROOT Trees (basic data format in partice physics)
Example of processing a set of ROOT trees:
Local ROOT
PROOF
// Create a chain of trees
root[0] TChain *c = CreateMyChain();
// Create a chain of trees
root[0] TChain *c = CreateMyChain();
// MySelec is a TSelector
root[1] c->Process(“MySelec.C+”);
// Start PROOF and tell the chain
// to use it
root[1] TProof::Open(“masterURL”);
root[2] c->SetProof()
// Process goes via PROOF
root[3] c->Process(“MySelec.C+”);
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
5
Classic batch processing
catalog
files
Batch farm
Storage
queues
query
data file splitting
myAna.C
jobs
submit
merging
final analysis
outputs
manager
 static use of resources
 jobs frozen: 1 job / worker node
 external splitting, merging
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
6
PROOF processing
catalog
files
PROOF farm Storage
scheduler
query
PROOF job:
data file list, myAna.C
feedbacks
(merged)
final
outputs
(merged)
MASTER
 farm perceived as extension of local PC
 same syntax as in local session
 more dynamic use of resources
 real time feedback
 automated splitting and merging
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
7
Challenges for PROOF
• Remain efficient under heavy load
• 100% exploitation of resources
• Reliability
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
8
Levels of scheduling
• The packetizer
– Load balancing on the level of a job
• Resource scheduling
(assigning resources to different jobs)
– Introducing a central scheduler
– Priority based scheduling on worker nodes
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
9
Packetizer’s role
• Lookup – check locations of all files and initiate
•
•
staging, if needed
Workers contact packetizer and ask for new
packets (pull architecture)
A Packet has info on
– which file to open
– which part of file to process
• Packetizer keeps assigning packets until the
dataset is processed
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
10
PROOF dynamic load balancing
• Pull architecture guarantees scalability
Worker 1
Master
Worker N
packet:
unit of work
distribution
Time
• Adapts to variations in performance
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
11
TPacketizer: the original packetizer
• Strategy
– Each worker processes its local files and then
processes remaining remote files
– Fixed size packets
– Avoid overloading data server by allowing
max 4 remote files to be served
• Problems with the TPacketizer
– Long tails with some I/O bound jobs
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
12
Performance tests with ALICE
• 35 PCs, dual Xeon 2.8 Ghz, ~200 GB disk
•
•
– Standard CERN hardware for LHC
Machine pools managed by xrootd
– Data of Physics Data Challenge ’06 distributed
(~ 1 M events)
Tests performed
– Speedup (scalability) tests
– System response when running a combination of job
types for increasing # of concurrent users
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
13
Example of problems with some I/O bound
jobs
Processing rate
during a query:
Resource utilization:
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
14
How to improve
• Focus on I/O based jobs
– Limited by hard drive or network bandwidth
• Predict which data servers can become
bottlenecks
• Make sure that other workers help
analyzing data from those servers
• Use time-based packet sizes
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
15
TAdaptivePacketizer
• Strategy
– Predicting the processing time of local
files for each worker
– For the workers that are expected to finish
faster, keep assigning remote files from
the beginning of the job.
– Assign remote files from the most heavily
loaded file servers
– Variable packet size
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
16
Improvement by up to 30%
TPacketizer
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
TAdaptivePacketizer
17
Scaling comparison for
randomly distributed data set
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
18
Resource scheduling
• Motivation
• Central scheduler
– Model
– Interface
• Priority based scheduling on worker nodes
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
19
Why scheduling?
• Controlling resources and how they are
used
• Improving efficiency
– assigning to a job those nodes that have data
which needs to be analyzed.
• Implementing different scheduling policies
– e.g. fair share, group priorities & quotas
• Efficient use even in case of congestion
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
20
PROOF specific requirements
• Interactive system
– Jobs should be processed as soon as submitted.
– However when max. system throughput is reached
some jobs has to postponed
• I/O bound jobs use more resources at the start
•
•
•
and less at the end (file distribution)
Try to process data locally
User defines a dataset not the #workers
Possibility to remove/add workers during a job
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
21
Starting a query with
a central scheduler (planed)
Master
Client
packetizer
job
Start workers
External
Scheduler
Cluster
status
ACAT 23 - 27th of April
2007
Dataset
Lookup
User
priority,
history
Jan Iwaszkiewicz, CERN PH/SFT
22
Plans
• Interface for scheduling "per job”
– Special functionality will allow to change the
set of nodes during a session without loosing
user libraries and other settings
• Removing workers during a job
• Integration with a scheduler
– Maui, LSF?
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
23
Priority based scheduling on nodes
• Priority-based worker level load balancing
– Simple and solid implementation, no central unit
– Group priorities defined in the configuration file
• Performed on each worker node independently
• Lower priority processes slowdown
– sleep before next packet request
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
24
Summary
• The adaptive packetizer is working very
well in current environment. Will be
further tuned after introducing the
scheduler
• Advanced work on PROOF interface to
scheduler.
• Priority-based scheduling on nodes is
being tested
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
25
The PROOF Team
•
•
•
•
•
•
•
Maarten Ballintijn
Bertrand Bellenot
Rene Brun
Gerardo Ganis
Jan Iwaszkiewicz
Andreas Peters
Fons Rademakers
http://root.cern.ch
ACAT 23 - 27th of April
2007
Jan Iwaszkiewicz, CERN PH/SFT
26