SW agent

DIRAC – LHCb MC production system
A.Tsaregorodtsev,
CPPM, Marseille
For the LHCb Data Management team
CHEP, La Jolla
25 March 2003
1
Outline





Introduction
DIRAC architecture
Implementation details
Deploying DIRAC on the DataGRID
Conclusions
2
What is it all about ?
DIRAC – Distributed Infrastructure with
Remote Agent Control

Distributed MC production system for LHCb





Automates most of the production tasks


Production tasks definition and steering;
Software installation on production sites;
Job scheduling and monitoring;
Data transfers and bookkeeping.
Minimum participation of local production managers
PULL rather than PUSH concept for jobs scheduling
3
DIRAC architecture
Monitoring service
Bookkeeping service
Production service
Bookkeeping data
Monitoring info
SW agent
Get jobs
SW agent
Site A
SW agent
SW agent
Site D
Site B
Site C
4
Advantages of the PULL approach

Better use of resources



Less burden on the central production service




no idle or forgotten CPU power;
natural load balancing – more powerful center gets more
work automatically.
deals only with production tasks definition and bookkeeping;
do not bother about particular production sites.
No direct access to local disks from central service
Easy introduction of new sites into the production
system

no information on local sites necessary at the central site.
5
Job description
Workflow description
Web based
editors
Pythia – v2
Gauss - v5
Gauss - v5
Brunel - v12
Gauss - v5
Brunel - v12
Production manager
Gauss - v5
Production run description
GenTag v7
XML job descriptions
+
- Event type
- Application options
- Number of events
- Execution mode
- Destination site …
Production DB
6
Agent operations
Production
agent
batch
system
ProductionSW distribution Monitoring Bookkeeping Mass
service
service
service
service
Storage
isQueueAvalable()
requestJob(queue)
installPackage()
submitJob(queue)
setJobStatus(step 1)
setJobStatus(step 2)
…
setJobStatus(step n)
sendBookkeeping()
sendFileToCastor()
addReplica()
7
Implementation details

Central web services




XML-RPC servers ;
Web based editing and visualization ;
ORACLE production and bookkeeping databases.
Agent - a set of collaborating python classes



Python 1.5.2 to be sure it is compatible with all the sites ;
standard python library XML-RPC client ;
The agent is running as a daemon process or as a cron job
on a production site.
Easily extendable via plugins:
• for new applications ;
• for new tools, e.g. file transport .

Data and log files transfer using bbftp ;
8
Agent customization at a
production site


Easy setting up of a production site is crucial to
absorb all available resources ;
One Python script where all the local
configuration is defined :
 Interface
to the local batch system;
 Interface to the local mass storage system;

Agent distribution comes with examples of
typical cases

“Standard” site can be configured in few minutes
• e.g., PBS + disk mass storage.
9
Dealing with failures

Job is rescheduled in case of a local system
failure to run it
 Other

sites can then pick it up.
Journaling
 all
the sensitive files (logs, bookkeeping, job descriptions)
are kept at the production site caches.

Job can be restarted from where it failed
 Accomplished

steps are not redone.
File transfers are automatically retried after a
predefined pause in case of failures.
10
Working experience

DIRAC production system was deployed on 17 LHCb
production sites :



Smooth running for MC production tasks ;
Much less burden for local production managers :




2 hours to 2 days of work for customization.
automatic data upload to CERN/Castor ;
log files automatically available through a Web page ;
automatic recoveries from common failures (job submission,
data transfers) ;
The current Data Challenge production using DIRAC
advances ahead of schedule


~1000 CPU’s in total used;
1M events produced per day.
11
DIRAC on the DataGRID
Bookkeeping service
Monitoring service
Production service
Castor
DataGRID portal
CERN SE
DataGRID
WN
Replica manager
WN
Replica catalog
job.xml
JDL
Resource Broker
WN
12
Deploying agents on the DataGRID
INPUT:
 JDL InputSandbox contains:


job XML description;
agent launcher script:
> wget ‘http://…/distribution/dmsetup’
> dmsetup --local DataGRID
> shoot_agent job.xml
OUTPUT:
 Use EDG replica_manager for data transfer to
CERN SE/Castor ;
 Log files are passed back via OutputSandbox .
13
Tests on the DataGRID testbed

Standard LHCb production jobs were used for the tests :


Jobs submitted to 4 EDG testbed Resource Brokers :


Jobs of different statistics with 8 steps workflow.
keeping ~50 jobs per broker ;
Software installed for each job ;
Job type (hours)
Total
Success Success rate
Mini (0.2)
190
113
59%
Short (6)
171
102
59%
Medium (24)
1195
346
29%
Total
1556
561
36%
Total of ~300K events produced so far. This makes EDG
14
testbed already a competitive LHCb production site.
Main problems

EDG middleware instability problems :



MDS information system failures – “no matching resources found”;
RB fails to get input files because of gridftp failures;
Jobs stuck in some unfinished state:
• “Done”,”Resubmitted”,etc

Long jobs suffering from site misconfiguration:




RB fails to find appropriate resources;
Jobs hit the limits of the local batch system;
“Estimated Traversal Time” failure as ranking criteria;
Software installation failures:


Disk quotas;
Forbidden outbound IP connections on WN’s on some sites.
15
Some lessons learnt

Needed an API for the software installation
 For experiments to install software:
• independently from site managers;
• on per job basis if necessary.
 For site managers to be sure the software
installed in an organized way.

is
Outbound IP connectivity should be available
 Needed
 Needed
for the software installation;
for jobs exchanging messages with
production services .

Uniform site descriptions:
 EDG
uniform CPU unit ?
16
Conclusions

The DIRAC production system is routinely running in
production now at ~17 sites ;

The PULL paradigm for jobs scheduling proved to be
very successful ;

It is of great help for local production managers and a
key for the success of the LHCb Data Challenge
2003 ;

The DataGRID testbed is integrated in the DIRAC
production system, extensive tests are in progress .
17