Role of Tier-0, Tier-1 and Tier-2 Regional Centers during CMS DC04 D. Bonacorsi (CNAF-INFN Bologna, Italy) on behalf of the CMS Collaboration Outline Introductory overview on: CMS Pre-Challenge Production (PCP) CMS Data Challenge (DC04) ideas, strategies, key points: ( main focus on Regional Centers (RC) ) Role of RCs in data distribution infrastructure description of distinct scenarios deployed and tested in DC04 Successes, failures, experience gained, issues raised Summary and conclusions CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ] 2 CMS PCP-DC04 overview Validation of CMS computing model on a sufficient number of Tier-0/1/2 sites large scale test of the computing/analysis models • Pre-Challenge Production: PCP (Jul. 03 - Feb. 04) – Simulation and digitization of data samples needed as input for DC – PCP Strategy: • mainly non-grid productions, but also grid prototypes (CMS/LCG-0, LCG-1, Grid3) Generation Simulation PCP ~70M Monte Carlo events (20M with Geant-4) produced, 750K jobs, 3500 KSI2000 months, 80 TB of data Digitization • Data Challenge : DC04 (Mar. - Apr. 04) – Reconstruction and analysis on CMS data sustained over 2 months at the 5% of the LHC rate at full luminosity 25% of start-up lumi – Data distribution to Tier-1,Tier-2 sites – DC Strategy: • • • • • sustain a 25 Hz reconstruction rate in the Tier-0 farm register data and metadata to a world-readable catalogue transfer reconstructed data from Tier-0 to Tier-1 centers analyze reconstructed data at the Tier-1/2’s as they arrive monitor and archive resources and process information DC04 Aimed to the demostration of feasibility of the full chain CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) Reconstruction Analysis [ id498 ses9 tr5 ] 3 Global DC04 layout and data distribution infrastructure Tier-2 Tier-2 Tier-2 Physicist Tier-0 GDB ORCA RECO Job RefDB Tier-0 data distribution agents Tier-1 Tier-1 Tier-1 agent Tier-1 Tier-1 EB IB TMDB fake on-line process Castor MSS POOL RLS catalogue CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) Physicist Physicist T2 T2 storage T2 storage storage LCG-2 Services agent Tier-1 T1 agent MSS storage T1 MSS storage T1 MSS ORCA storage ORCA Analysis ORCA Grid Job Job ORCA Analysis ORCA Grid Job ORCA Job Analysis Grid Job Job D.Bonacorsi (CNAF-INFN, Italy) ORCA ORCA Local Job ORCA Local Job Local Job [ id498 ses9 tr5 ] 4 DC04 key points and Regional Centers involvement • Maximize reconstruction efficiency at the Tier-0 • Automatic registration and distribution of data see also [ id162 ses7 tr4 ] via a set of loosely coupled agents running at the Tier-1’s key role of the Transfer Management DB (TMDB) inter-agent communication • Support a (reasonable) variety of data transfer strategies (and MSS): LCG-2 Replica Manager (CNAF, PIC T1’s: with LCG-2 Castor-SE) native SRM (FNAL T1: with dCache+Enstore) SRB (RAL, IN2P3, GridKA, T1’s: with Castor, HPSS,…) this reflects into 3 distinct distribution chains T0 T1’s see later • Use a single global file catalogue (accessible from all Tier-1’s) RLS used for data and metadata (POOL) by all transfer tools • Redundant monitor/archive of info on resources and processes: MonaLisa global monitoring of network and all CPU resources, LEMON dedicated monitoring of DC04 Tier-0 resources, GridICE monitoring all LCG resources • Grant data access at the Tier-2’s for “real-time data analysis” CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ] 5 Hierarchy of RCs in DC04 and data distribution chains Tier-0 CERN LCG-2 RM chain SRB chain SRM chain CNAF (Italy) Legnaro PIC (Spain) FNAL (USA) CIEMAT CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) UFL Tier-1’s RAL (UK) GridKA (Germany) Caltech D.Bonacorsi (CNAF-INFN, Italy) IN2P3 (France) Tier-2’s [ id498 ses9 tr5 ] 6 Tier-0 Architecture built on: Systems • LSF batch system Tier-0 3 racks, 44 nodes each, dedicated: tot 264 CPUs Dual P-IV Xeon 2.4GHz, 1GB mem, 100baseT Dedicated cmsdc04 batch queue, 500 RUN-slots GDB • Disk servers: ORCA RECO Job DC04 dedicated stager, with 2 pools 2 pools: IB and GDB, 10 + 4 TB Export Buffers • EB-SRM ( 4 servers, 4.2 TB total ) • EB-SRB ( 4 servers, 4.2 TB total ) • EB-SE ( 3 servers, 3.1 TB total ) Databases • RLS (Replica Location Service) • TMDB (Transfer Management DB) Transfer steering • Agents steering data transfers on a dedicated node (close monitoring..) Tier-0 data distrib. agents EB IB RefDB TMDB fake on-line process Castor POOL RLS catalogue Monitoring Services CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ] 7 The LCG-2 chain (1/2) • involved Tier-1’s: CNAF and PIC RLS TMDB Principle: data replication between LCG-2 SEs see also [ id497 ses9 tr5 ] Set-up: Tier-0: 1 EB - classic disk-based LCG-2 SE 3 SE machines with 1 TB each CERN Castor RM data distribution agent Tier-1’s: a Castor-SE receiving data but different underlying MSS hardware solution Disk SE EB Strategies comparison: CNAF: Replica Manager CLI (+ LRC C++ API for listing replicas only) copy a file and inherently register it to the RLS, with file-size info stored in the LRC over-head introduced by CLI java processes safer against failed replicas Tier-1 PIC: globus-url-copy + LRC C++ API copy a file and later register to the RLS, no file-size check faster! no quality-check of replica operations Castor CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) Tier-1 agent CASTOR SE Tier-2 Disk SE [ id498 ses9 tr5 ] 8 The LCG-2 chain (2/2) • both CNAF and PIC approaches achieved good performances T1 agents robust, kept the pace with data available at EB network ‘stress-test’ at the end of DC04 with ‘big’ files: typical transfer rates >30 MB/s, CNAF sustained >42 MB/s for some hours CERN Tier-0: SE-EB of the LCG chain CNAF T1 network monitoring eth I/O eth I/O CNAF Tier-1: Castor-SE ~340 Mbps >3k files >750 GB CNAF Tier-1: Classic disk-SE • dealing with too many small files (a DC issue affecting all distribution chains): “bad” for: efficient use of bandwidth scalability of MSS systems CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ] 9 The SRM chain (1/2) • involved Tier-1: FNAL Principle: SRM transactions to receive TURLs from EB, transfers via gridFTP see [ id190 ses7 tr4 ] Set-up: Tier-0: SRM/dCache based DRM serving as an EB files are staged out of Castor to the dCache pool disk and pinned until transferred Tier-1: SRM/dCache/Enstore based HRM acting as Import Buffer with SRM interface providing access to Enstore via dCache SRM client (T1 agent machine) FNAL T1 CERN T0 SRM-COPY SRM-GET (one file at a time, return TURL) SRM (performs space GridFTP GET (pull mode) reservation, write) SRM (performs stage, pin file) Network transfer Enstore dCache Pool CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) dCache Pool D.Bonacorsi (CNAF-INFN, Italy) EB [ id498 ses9 tr5 ] 10 The SRM chain (2/2) • in general quite robust tools Number of Transferred files e.g. SRM in error checking/retrying, dCache for authomatic migration to tape, .. 20000 15000 10000 5000 26-Apr-2004 19-Apr-2004 12-Apr-2004 5-Apr-2004 29-Mar-2004 22-Mar-2004 15-Mar-2004 8-Mar-2004 1-Mar-2004 0 • stressed a few software/hardware components to the breaking point e.g. monitoring not implemented to catch service failures, forcing manual interventions • again: problems of high-nb/small-size of DC files use of multiple streams with multiple files in each stream reduced the overhead of the authentication process MSS optimization necessary to handle the challenge load Inefficient use of tapes forced more tapes allocations + deployment of larger namespace service relevant improvements during ongoing DC operations e.g. reduction on delegated proxy’s modulus size in SRM yielding a speed-up of the interaction between SRM client and server of a factor 3.5 CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ] 11 The SRB chain (1/2) • involved Tier-1’s: GridKA, IN2P3, RAL Principle: use SRB to transfer files to local MSS with consistent catalog info Set-up: Tier-0: SRB EB files copied from Castor to EB machine then ‘inserted’ into SRB virtual space (both data and metadata) Tier-1’s: one SRB IB at each site data replication with SRB commands, i.e. Sreplicate or Sget/Sput • GMCat component developed in UK link the SRB namespaces by publishing SRB replica info into the RLC at CERN periodically • again: problems of high-nb/small-size of DC files troublesome injection process of the initial entries onto the SRB EB at the T0 unexpected inefficiencies with SRB commands on small files.. • reasonable T0T1 transfer rates e.g. IN2P3 averaged at ~30 Mbps, and sustained 80 Mbps for some hours mainly limited by the small file sizes.. CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ] 12 The SRB chain (2/2) • while successful in PCP (that’s why some sites chose it), SRB showed unexpected poor performance in DC04 • severely hampered by technical issues: MCat single point of failure: unusability of metadata catalogue at RAL Loss of performance, long time queries causing transfer commands to timeout, core-dumps.. several annoyances in both client/server sw of SRB v.2 used in DC04 SRB commands return code not reliable, hard to cleanly kill on-going Sreplicate processes, ... GridKA Tier-1 • its use was stopped before official end of DC04 MCat problems.. T1’s of the SRB chain did not take part to the large file transfer test at the end of DC04 • in-depth investigation in progress most problematic items successfully being addressed in SRB v.3 CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ] 13 Tier-2’s: real-time data analysis • Tier-2’s involved in DC04: CIEMAT Legnaro UFL, Caltech referring to PIC T1 referring to CNAF T1 referring to FNAL T1 see also [ id136 ses9 tr5 ] LCG-2 chain SRM chain • LCG-2 chain: automatic procedures to advertise analysists that new data became available on T1 and T2 disk-SEs… difficult to identify complete file sets.. … then job submission is automatically triggered, via the Resource Broker job processes run at site close to data, access files via rfio, register output onto RLS, ... >15k jobs submitted for about 2 weeks via LCG-2 ran through the system real-time data analysis at PIC measured a median delay of ~20 minutes between files being ready for distribution at the T0 and analysis jobs being submitted at the T1 • SRM chain: FNAL T1 deployed a MySQL POOL catalogue to enable access to DC data transferred to US a few days of data access was attempted through dCache via a ROOT plug-in, allowing for COBRA based applications to access the data software environment based on access to applications over AFS at CERN high number of small files logistically difficult to find the needed files: stored by date on tape, thus many stages required to complete a file-set CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ] 14 An example: Replica to disk-SEs Just one day: Apr, 19th CNAF T1 Castor SE eth I/O input from SE-EB CNAF T1 disk-SE eth I/O input from Castor SE green Legnaro T2 disk-SE eth I/O input from Castor SE CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ] 15 Summary and Conclusions The full chain is demonstrated to be feasible, but for limited amount of time • Tier-0: • Tier-1’s: different Tier-1 performances, related to operational choices • main areas for future improvements have been identified: – reconstruction/data-transfer/analysis may run at 25 Hz – 2200 running jobs/day (on ~500 CPU’s), 4 MB/s produced and distributed to each Tier-1, 0.4 files/s registered to RLS (with POOL metadata) – key items raised and addressed but e.g. good overall performance of LCG-2 chain (among others) throughout the DC – Reduce number of files (i.e. increase <#events>/<#files>) • more efficient use of bandwidth • fixed time to “start-up” dominates command execution times (e.g. java in replicas..) • address scalability of MSS systems – Better organize in advance, foresee what the real working scenarios will be • avoid to work in a “always-reacting-to-something” mode.. • avoid conditions of “statistical debugging” on too many files in problematic states.. • Real-time analysis at Tier-2’s was demonstrated to be possible – Time window between reco data availability - start of analysis jobs can be reasonably low – … but need a clean environment. CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ] 16 Full authors list T. Barras, S. Metson, Bristol University, United Kingdom J. Andreeva, W. Jank, N. Sinanis, CERN, Switzerland N. Colino, P. Garcia-Abia, J. M. Hernandez, F. J. Rodriguez-Calonge, CIEMAT, Madrid, Spain M. Ernst, DESY, Germany A. Anzar, L. Bauerdick, I. Fisk, R. Harris, Y. Wu, , FNAL, Batavia, USA G. Quast, K. Rabbertz, J. Rehn, Karlsruhe University, Germany N. De Filippis, G. Donvito, G. Maggi, INFN-Bari, Italy P. Capiluppi, A. Fanfani, C. Grandi, INFN-Bologna, Italy D. Bonacorsi, A.Chierici, L. Dell’Agnello, G. LoRe, B. Martelli, P. Ricci, F. Rosso, F. Ruggieri, INFN-CNAF, Italy M. Biasotto, S. Fantinel, INFN-Legnaro, Italy M. Corvo, F. Fanzago, M. Mazzucato, INFN-Padova, Italy C.Charlot, P.Mine', I.Semeniouk, LLR-Ecole Polytechnique, CNRS&IN2P3, France L. Tuura, Northeastern University, Boston, USA M. Delfino, F. Martinez, G. Merino, A. Pacheco, M. Rodriguez, PIC, Barcelona, Spain D. Stickland, T. Wildish, Princeton University, USA D. Newbold, C. Shepherd-Themistocleous, RAL, United Kingdom A. Nowack, RWTH Aachen, Germany CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ] 17
© Copyright 2025 Paperzz