SE-EB of the LCG chain CNAF Tier-1

Role of Tier-0, Tier-1 and
Tier-2 Regional Centers
during CMS DC04
D. Bonacorsi (CNAF-INFN Bologna, Italy)
on behalf of the CMS Collaboration
Outline
 Introductory overview on:
 CMS Pre-Challenge Production (PCP)

CMS Data Challenge (DC04)
 ideas, strategies, key points:
( main focus on Regional Centers (RC) )
 Role of RCs in data distribution infrastructure
 description of distinct scenarios deployed and tested in DC04
 Successes, failures, experience gained, issues raised
 Summary and conclusions
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
[ id498 ses9 tr5 ]
2
CMS PCP-DC04 overview
Validation of CMS computing model on a sufficient number of Tier-0/1/2 sites
 large scale test of the computing/analysis models
• Pre-Challenge Production: PCP (Jul. 03 - Feb. 04)
– Simulation and digitization of data samples needed as input for DC
– PCP Strategy:
• mainly non-grid productions, but also grid prototypes
(CMS/LCG-0, LCG-1, Grid3)
Generation
Simulation
PCP
 ~70M Monte Carlo events (20M with Geant-4) produced,
750K jobs, 3500 KSI2000 months, 80 TB of data
Digitization
• Data Challenge : DC04 (Mar. - Apr. 04)
– Reconstruction and analysis on CMS data sustained over 2 months
at the 5% of the LHC rate at full luminosity  25% of start-up lumi
– Data distribution to Tier-1,Tier-2 sites
– DC Strategy:
•
•
•
•
•
sustain a 25 Hz reconstruction rate in the Tier-0 farm
register data and metadata to a world-readable catalogue
transfer reconstructed data from Tier-0 to Tier-1 centers
analyze reconstructed data at the Tier-1/2’s as they arrive
monitor and archive resources and process information
DC04
Aimed to the demostration of feasibility of the full chain
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
Reconstruction
Analysis
[ id498 ses9 tr5 ]
3
Global DC04 layout
and data distribution infrastructure
Tier-2
Tier-2
Tier-2
Physicist
Tier-0
GDB
ORCA
RECO
Job
RefDB
Tier-0
data distribution
agents
Tier-1
Tier-1
Tier-1
agent
Tier-1
Tier-1
EB
IB
TMDB
fake on-line
process
Castor
MSS
POOL RLS
catalogue
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
Physicist
Physicist
T2
T2
storage
T2
storage
storage
LCG-2
Services
agent
Tier-1
T1
agent
MSS
storage
T1
MSS
storage
T1
MSS ORCA
storage
ORCA
Analysis
ORCA
Grid Job
Job ORCA
Analysis
ORCA
Grid Job
ORCA
Job
Analysis
Grid Job
Job
D.Bonacorsi (CNAF-INFN, Italy)
ORCA
ORCA
Local Job
ORCA
Local Job
Local Job
[ id498 ses9 tr5 ]
4
DC04 key points
and Regional Centers involvement
• Maximize reconstruction efficiency at the Tier-0
• Automatic registration and distribution of data
see also [ id162 ses7 tr4 ]
 via a set of loosely coupled agents running at the Tier-1’s
 key role of the Transfer Management DB (TMDB)  inter-agent communication
• Support a (reasonable) variety of data transfer strategies (and MSS):
 LCG-2 Replica Manager (CNAF, PIC T1’s: with LCG-2 Castor-SE)
 native SRM (FNAL T1: with dCache+Enstore)
 SRB (RAL, IN2P3, GridKA, T1’s: with Castor, HPSS,…)
this reflects into 3 distinct distribution chains T0  T1’s
see later
• Use a single global file catalogue (accessible from all Tier-1’s)
 RLS used for data and metadata (POOL) by all transfer tools
• Redundant monitor/archive of info on resources and processes:
 MonaLisa global monitoring of network and all CPU resources, LEMON dedicated
monitoring of DC04 Tier-0 resources, GridICE monitoring all LCG resources
• Grant data access at the Tier-2’s for “real-time data analysis”
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
[ id498 ses9 tr5 ]
5
Hierarchy of RCs in DC04
and data distribution chains
Tier-0
CERN
LCG-2 RM chain
SRB chain
SRM chain
CNAF
(Italy)
Legnaro
PIC
(Spain)
FNAL
(USA)
CIEMAT
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
UFL
Tier-1’s
RAL
(UK)
GridKA
(Germany)
Caltech
D.Bonacorsi (CNAF-INFN, Italy)
IN2P3
(France)
Tier-2’s
[ id498 ses9 tr5 ]
6
Tier-0
Architecture built on:
 Systems
• LSF batch system
Tier-0
 3 racks, 44 nodes each, dedicated: tot 264 CPUs
 Dual P-IV Xeon 2.4GHz, 1GB mem, 100baseT
 Dedicated cmsdc04 batch queue, 500 RUN-slots
GDB
• Disk servers:
ORCA
RECO
Job
 DC04 dedicated stager, with 2 pools
 2 pools: IB and GDB, 10 + 4 TB
 Export Buffers
• EB-SRM ( 4 servers, 4.2 TB total )
• EB-SRB ( 4 servers, 4.2 TB total )
• EB-SE ( 3 servers, 3.1 TB total )
 Databases
• RLS (Replica Location Service)
• TMDB (Transfer Management DB)
 Transfer steering
• Agents steering data transfers
 on a dedicated node (close monitoring..)
Tier-0
data distrib.
agents
EB
IB
RefDB
TMDB
fake on-line
process
Castor
POOL RLS
catalogue
 Monitoring Services
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
[ id498 ses9 tr5 ]
7
The LCG-2 chain (1/2)
• involved Tier-1’s: CNAF and PIC
RLS
TMDB
 Principle: data replication between LCG-2 SEs
see also [ id497 ses9 tr5 ]
Set-up: Tier-0: 1 EB - classic disk-based LCG-2 SE
 3 SE machines with 1 TB each
CERN
Castor
RM data
distribution
agent
Tier-1’s: a Castor-SE receiving data
 but different underlying MSS hardware solution
Disk SE
EB
Strategies comparison:
CNAF: Replica Manager CLI (+ LRC C++ API for listing replicas only)
 copy a file and inherently register it to the RLS, with file-size
info stored in the LRC
 over-head introduced by CLI java processes
 safer against failed replicas
Tier-1
PIC: globus-url-copy + LRC C++ API
 copy a file and later register to the RLS, no file-size check
 faster!
 no quality-check of replica operations
Castor
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
Tier-1
agent
CASTOR
SE
Tier-2
Disk SE
[ id498 ses9 tr5 ]
8
The LCG-2 chain (2/2)
• both CNAF and PIC approaches achieved good performances
 T1 agents robust, kept the pace with data available at EB
 network ‘stress-test’ at the end of DC04 with ‘big’ files:
 typical transfer rates >30 MB/s, CNAF sustained >42 MB/s for some hours
CERN Tier-0:
SE-EB of the LCG chain
CNAF T1 network monitoring
eth I/O
eth I/O
CNAF Tier-1:
Castor-SE
~340 Mbps
>3k files
>750 GB
CNAF Tier-1:
Classic disk-SE
• dealing with too many small files (a DC issue affecting all distribution chains):
“bad” for:
 efficient use of bandwidth
 scalability of MSS systems
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
[ id498 ses9 tr5 ]
9
The SRM chain (1/2)
• involved Tier-1: FNAL
 Principle: SRM transactions to receive TURLs from EB, transfers via gridFTP
see [ id190 ses7 tr4 ]
Set-up: Tier-0: SRM/dCache based DRM serving as an EB
 files are staged out of Castor to the dCache pool disk and pinned until transferred
Tier-1: SRM/dCache/Enstore based HRM
 acting as Import Buffer with SRM interface providing access to Enstore via dCache
SRM client
(T1 agent machine)
FNAL T1
CERN T0
SRM-COPY
SRM-GET (one file at a time, return TURL)
SRM
(performs space
GridFTP GET (pull mode)
reservation, write)
SRM
(performs stage,
pin file)
Network transfer
Enstore
dCache Pool
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
dCache Pool
D.Bonacorsi (CNAF-INFN, Italy)
EB
[ id498 ses9 tr5 ]
10
The SRM chain (2/2)
• in general quite robust tools
Number of Transferred files
 e.g. SRM in error checking/retrying, dCache for authomatic migration to tape, ..
20000
15000
10000
5000
26-Apr-2004
19-Apr-2004
12-Apr-2004
5-Apr-2004
29-Mar-2004
22-Mar-2004
15-Mar-2004
8-Mar-2004
1-Mar-2004
0
• stressed a few software/hardware components to the breaking point
 e.g. monitoring not implemented to catch service failures, forcing manual interventions
• again: problems of high-nb/small-size of DC files
 use of multiple streams with multiple files in each stream
 reduced the overhead of the authentication process
 MSS optimization necessary to handle the challenge load
 Inefficient use of tapes forced more tapes allocations + deployment of larger namespace service
 relevant improvements during ongoing DC operations
 e.g. reduction on delegated proxy’s modulus size in SRM yielding a speed-up of the interaction
between SRM client and server of a factor 3.5
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
[ id498 ses9 tr5 ]
11
The SRB chain (1/2)
• involved Tier-1’s: GridKA, IN2P3, RAL
 Principle: use SRB to transfer files to local MSS with consistent catalog info
Set-up: Tier-0: SRB EB
 files copied from Castor to EB machine then ‘inserted’ into SRB virtual space
(both data and metadata)
Tier-1’s: one SRB IB at each site
 data replication with SRB commands, i.e. Sreplicate or Sget/Sput
• GMCat component developed in UK
 link the SRB namespaces by publishing SRB replica info into the RLC at CERN periodically
• again: problems of high-nb/small-size of DC files
 troublesome injection process of the initial entries onto the SRB EB at the T0
 unexpected inefficiencies with SRB commands on small files..
• reasonable T0T1 transfer rates
 e.g. IN2P3 averaged at ~30 Mbps, and sustained 80 Mbps for some hours
 mainly limited by the small file sizes..
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
[ id498 ses9 tr5 ]
12
The SRB chain (2/2)
• while successful in PCP (that’s why some sites chose it),
SRB showed unexpected poor performance in DC04
• severely hampered by technical issues:
 MCat single point of failure: unusability of metadata catalogue at RAL
 Loss of performance, long time queries causing transfer commands to timeout, core-dumps..
 several annoyances in both client/server sw of SRB v.2 used in DC04
 SRB commands return code not reliable, hard to cleanly kill on-going Sreplicate processes, ...
GridKA
Tier-1
• its use was stopped before
official end of DC04
MCat problems..
 T1’s of the SRB chain did not take
part to the large file transfer test
at the end of DC04
• in-depth investigation in progress
 most problematic items successfully
being addressed in SRB v.3
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
[ id498 ses9 tr5 ]
13
Tier-2’s: real-time data analysis
• Tier-2’s involved in DC04:
 CIEMAT
 Legnaro
 UFL, Caltech
referring to PIC T1
referring to CNAF T1
referring to FNAL T1
see also [ id136 ses9 tr5 ]
LCG-2 chain
SRM chain
• LCG-2 chain:
 automatic procedures to advertise analysists that new data became available
on T1 and T2 disk-SEs…
 difficult to identify complete file sets..
 … then job submission is automatically triggered, via the Resource Broker
 job processes run at site close to data, access files via rfio, register output onto RLS, ...
 >15k jobs submitted for about 2 weeks via LCG-2 ran through the system
 real-time data analysis at PIC measured a median delay of ~20 minutes between
files being ready for distribution at the T0 and analysis jobs being submitted at the T1
• SRM chain:
 FNAL T1 deployed a MySQL POOL catalogue to enable access to DC data transferred to US
 a few days of data access was attempted through dCache via a ROOT plug-in, allowing for
COBRA based applications to access the data
 software environment based on access to applications over AFS at CERN
 high number of small files
 logistically difficult to find the needed files: stored by date on tape, thus many stages required to
complete a file-set
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
[ id498 ses9 tr5 ]
14
An example:
Replica to disk-SEs
Just one day:
Apr, 19th
CNAF T1 Castor SE
eth I/O input
from SE-EB
CNAF T1 disk-SE
eth I/O input
from Castor SE
green
Legnaro T2 disk-SE
eth I/O input from Castor SE
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
[ id498 ses9 tr5 ]
15
Summary and Conclusions
The full chain is demonstrated to be feasible, but for limited amount of time
•
Tier-0:
•
Tier-1’s: different Tier-1 performances, related to operational choices
•
main areas for future improvements have been identified:
– reconstruction/data-transfer/analysis may run at 25 Hz
– 2200 running jobs/day (on ~500 CPU’s), 4 MB/s produced and
distributed to each Tier-1, 0.4 files/s registered to RLS (with POOL metadata)
–
key items raised and addressed
 but e.g. good overall performance of LCG-2 chain (among others) throughout the DC
– Reduce number of files (i.e. increase <#events>/<#files>)
• more efficient use of bandwidth
• fixed time to “start-up” dominates command execution times (e.g. java in replicas..)
• address scalability of MSS systems
– Better organize in advance, foresee what the real working scenarios will be
• avoid to work in a “always-reacting-to-something” mode..
• avoid conditions of “statistical debugging” on too many files in problematic states..
•
Real-time analysis at Tier-2’s was demonstrated to be possible
– Time window between reco data availability - start of analysis jobs can be
reasonably low
– … but need a clean environment.
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
[ id498 ses9 tr5 ]
16
Full authors list
T. Barras, S. Metson, Bristol University, United Kingdom
J. Andreeva, W. Jank, N. Sinanis, CERN, Switzerland
N. Colino, P. Garcia-Abia, J. M. Hernandez, F. J. Rodriguez-Calonge, CIEMAT, Madrid, Spain
M. Ernst, DESY, Germany
A. Anzar, L. Bauerdick, I. Fisk, R. Harris, Y. Wu, , FNAL, Batavia, USA
G. Quast, K. Rabbertz, J. Rehn, Karlsruhe University, Germany
N. De Filippis, G. Donvito, G. Maggi, INFN-Bari, Italy
P. Capiluppi, A. Fanfani, C. Grandi, INFN-Bologna, Italy
D. Bonacorsi, A.Chierici, L. Dell’Agnello, G. LoRe, B. Martelli, P. Ricci,
F. Rosso, F. Ruggieri, INFN-CNAF, Italy
M. Biasotto, S. Fantinel, INFN-Legnaro, Italy
M. Corvo, F. Fanzago, M. Mazzucato, INFN-Padova, Italy
C.Charlot, P.Mine', I.Semeniouk, LLR-Ecole Polytechnique, CNRS&IN2P3, France
L. Tuura, Northeastern University, Boston, USA
M. Delfino, F. Martinez, G. Merino, A. Pacheco, M. Rodriguez, PIC, Barcelona, Spain
D. Stickland, T. Wildish, Princeton University, USA
D. Newbold, C. Shepherd-Themistocleous, RAL, United Kingdom
A. Nowack, RWTH Aachen, Germany
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004)
D.Bonacorsi (CNAF-INFN, Italy)
[ id498 ses9 tr5 ]
17