DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003 1 Outline Introduction DIRAC architecture Implementation details Deploying DIRAC on the DataGRID Conclusions 2 What is it all about ? DIRAC – Distributed Infrastructure with Remote Agent Control Distributed MC production system for LHCb Automates most of the production tasks Production tasks definition and steering; Software installation on production sites; Job scheduling and monitoring; Data transfers and bookkeeping. Minimum participation of local production managers PULL rather than PUSH concept for jobs scheduling 3 DIRAC architecture Monitoring service Bookkeeping service Production service Bookkeeping data Monitoring info SW agent Get jobs SW agent Site A SW agent SW agent Site D Site B Site C 4 Advantages of the PULL approach Better use of resources Less burden on the central production service no idle or forgotten CPU power; natural load balancing – more powerful center gets more work automatically. deals only with production tasks definition and bookkeeping; do not bother about particular production sites. No direct access to local disks from central service Easy introduction of new sites into the production system no information on local sites necessary at the central site. 5 Job description Workflow description Web based editors Pythia – v2 Gauss - v5 Gauss - v5 Brunel - v12 Gauss - v5 Brunel - v12 Production manager Gauss - v5 Production run description GenTag v7 XML job descriptions + - Event type - Application options - Number of events - Execution mode - Destination site … Production DB 6 Agent operations Production agent batch system ProductionSW distribution Monitoring Bookkeeping Mass service service service service Storage isQueueAvalable() requestJob(queue) installPackage() submitJob(queue) setJobStatus(step 1) setJobStatus(step 2) … setJobStatus(step n) sendBookkeeping() sendFileToCastor() addReplica() 7 Implementation details Central web services XML-RPC servers ; Web based editing and visualization ; ORACLE production and bookkeeping databases. Agent - a set of collaborating python classes Python 1.5.2 to be sure it is compatible with all the sites ; standard python library XML-RPC client ; The agent is running as a daemon process or as a cron job on a production site. Easily extendable via plugins: • for new applications ; • for new tools, e.g. file transport . Data and log files transfer using bbftp ; 8 Agent customization at a production site Easy setting up of a production site is crucial to absorb all available resources ; One Python script where all the local configuration is defined : Interface to the local batch system; Interface to the local mass storage system; Agent distribution comes with examples of typical cases “Standard” site can be configured in few minutes • e.g., PBS + disk mass storage. 9 Dealing with failures Job is rescheduled in case of a local system failure to run it Other sites can then pick it up. Journaling all the sensitive files (logs, bookkeeping, job descriptions) are kept at the production site caches. Job can be restarted from where it failed Accomplished steps are not redone. File transfers are automatically retried after a predefined pause in case of failures. 10 Working experience DIRAC production system was deployed on 17 LHCb production sites : Smooth running for MC production tasks ; Much less burden for local production managers : 2 hours to 2 days of work for customization. automatic data upload to CERN/Castor ; log files automatically available through a Web page ; automatic recoveries from common failures (job submission, data transfers) ; The current Data Challenge production using DIRAC advances ahead of schedule ~1000 CPU’s in total used; 1M events produced per day. 11 DIRAC on the DataGRID Bookkeeping service Monitoring service Production service Castor DataGRID portal CERN SE DataGRID WN Replica manager WN Replica catalog job.xml JDL Resource Broker WN 12 Deploying agents on the DataGRID INPUT: JDL InputSandbox contains: job XML description; agent launcher script: > wget ‘http://…/distribution/dmsetup’ > dmsetup --local DataGRID > shoot_agent job.xml OUTPUT: Use EDG replica_manager for data transfer to CERN SE/Castor ; Log files are passed back via OutputSandbox . 13 Tests on the DataGRID testbed Standard LHCb production jobs were used for the tests : Jobs submitted to 4 EDG testbed Resource Brokers : Jobs of different statistics with 8 steps workflow. keeping ~50 jobs per broker ; Software installed for each job ; Job type (hours) Total Success Success rate Mini (0.2) 190 113 59% Short (6) 171 102 59% Medium (24) 1195 346 29% Total 1556 561 36% Total of ~300K events produced so far. This makes EDG 14 testbed already a competitive LHCb production site. Main problems EDG middleware instability problems : MDS information system failures – “no matching resources found”; RB fails to get input files because of gridftp failures; Jobs stuck in some unfinished state: • “Done”,”Resubmitted”,etc Long jobs suffering from site misconfiguration: RB fails to find appropriate resources; Jobs hit the limits of the local batch system; “Estimated Traversal Time” failure as ranking criteria; Software installation failures: Disk quotas; Forbidden outbound IP connections on WN’s on some sites. 15 Some lessons learnt Needed an API for the software installation For experiments to install software: • independently from site managers; • on per job basis if necessary. For site managers to be sure the software installed in an organized way. is Outbound IP connectivity should be available Needed Needed for the software installation; for jobs exchanging messages with production services . Uniform site descriptions: EDG uniform CPU unit ? 16 Conclusions The DIRAC production system is routinely running in production now at ~17 sites ; The PULL paradigm for jobs scheduling proved to be very successful ; It is of great help for local production managers and a key for the success of the LHCb Data Challenge 2003 ; The DataGRID testbed is integrated in the DIRAC production system, extensive tests are in progress . 17
© Copyright 2026 Paperzz