Status of the ALICE Grid sites ALICE Offline Week CERN, 29th October 2009 Patricia Méndez Lorenzo (CERN, IT/GS-EIS) Outlook ALICE is already in data taking mode This implies Maximal automatization and control of the local services by the site admins is needed 2 Follow up of the (local) WLCG services cannot be under the control of the central Grid team at CERN 101 VOBOXES distributed in 83 sites require an active implication of site admins and regional experts We need a clear statement of each site in terms of: 1. 2. 3. 4. Available resources and setup Available workload management system Management of local VOBOXES and backups Recovery procedures and protocols These points will be the topic of this talk Status of the ALICE Grid sites 29/10/09 WNs in SL5 Status of the ALICE Grid sites 29/10/09 WNs in SL(C)5 Requirement: 4 Requirement: deployment of SL5 in all WNs (T0, T1 and T2 sites) DEADLINE: Mid-September 2009 Service was deployed in production by the 23th of March 2009 By the 4th of August 2009, the WLCG MB promoted specifically the deployment of SL5 WNs at all sites ALICE Sites Status Status of the ALICE Grid sites Service Status: ALICE presentation during the last September WLCG GDB meeting: Experiment is asking the deployment of this service in all sites since Summer 2009 We are very far from the expectations presented in September 29/10/09 Status of the sites... We know about (I) (Deployment of SL5 and CREAM-CE at the sites are related) CERN (T0) SLC4 and SLC5, ALICE currently using SLC5 resources only RAL (T1) 5 Used together with CERN to test the ALICE sw in SL5, mostly migrated to SL5 CNAF (T1) Testing of SL5 ongoing. Most of the resources for LHC still in SL4 (mixed situation) CCIN2P3 (T1) Configuration of the site upgraded yesterday to increase the number of available SL5 resources FZK (T1) Status of the ALICE Grid sites Migration to SL5 finished 29/10/09 Status of the sites... We know about (II) Madrid (T2) Migrated to SL5 Trujillo (T2) 6 Migrated to SL5 Hiroshima (T2) Beginning the migration of the whole site to SL5 KISTI (T2) Mixed situation, both types of WNs available Subatech (T2) Status of the ALICE Grid sites Mixed situation, both types of WNs available 29/10/09 Status of the sites... We know about (III) ITEP (T2) Migrated to SL5 SPBSU (T2) 7 Migrated to SL5 Kolkata (T2) Mixed situation, SL4 and SL5 available iPNL (T2) Migrated to SL5 IHEP (T2) Status of the ALICE Grid sites Migrated to SL5 29/10/09 Status of the sites... We know about (IV) RRC-KI (T2) Migrated to SL5 GRIF_DAPNIA (T2) 8 Migrated to SL5 JINR (T2) Migrated to SL5 Troitsk (T2) Status of the ALICE Grid sites Migrated to SL5 29/10/09 The gLite3.2 VOBOX Status of the ALICE Grid sites 29/10/09 Evolution of the service « Pre-PPS » version announced at the beginning of October Patch #3205 (SL5 VOBOX) and #3040 (WMS UI fixes) installed at CERN for testing purposes Instructions provided by IT-GD and distributed through ALICE TF to sites, specifing: 10 New rpm issues stopped the deployment of the patch for almost 4 weeks ALICE sites have been informed about any advance and issue at any moment Two sites more had migrated the VOBOX already: KISTI and ITEP The patch is in production since the 13th October 2009 Status of the ALICE Grid sites It is a testing patch to gain familiary with the system and its installation However only 12 sites are providing today this new version 29/10/09 The gLite3.2 VOBOX: Issues Thanks to the availability of the gLite3.2 VOBOXES, we were able to find a compatibility problem associated to the libs 11 The libs needed by the new VOBOX for its normal operations (proxy-renewal, etc) entered in concurrence with those provided by AliEn Status: SOLVED Changes in the environment setup before and after any VOBOX specific operation Further issues: NONE The current gLite3.2 VOBOX should not show any further problem for the experiment or the site admins Status of the ALICE Grid sites Very easy installation and configuration 29/10/09 Status of the SL5 vobox at the sites (I) CERN (T0) In production since more than one month CNAF (T1) 12 Available in production FZK (T1) Registration of the new VOBOX DN required RAL (T1): Status of the ALICE Grid sites HW already located, installation ongoing 29/10/09 Status of the SL5 vobox at the sites (II) Subatech (T2) Using the PPS installation POZNAN (T2), ITEP (T2), Troistk (T2), KISTI 13 (T2), SPBU (T2), Kolkata (T2) Deployed JINR (T2) Registration of the new DN required Hiroshima (T2) Status of the ALICE Grid sites Installation ongoing 29/10/09 Hybrid situation Hybrid (worse) situation: WNs in SL4/32b and SL5/64b 14 This is the situation ALICE wanted to avoid in September The experiment can manage this situation if and only if: 2 VOBOXES are provided (each one will run an independent PackMan service) Status of the ALICE Grid sites and this independent of CREAM! 2 different software areas (per VOBOX, per cluster) are provided The sw version changes with the architecture The support infrastructure has to be doupled 29/10/09 Control and management of the VOBOXES Status of the ALICE Grid sites 29/10/09 The VOBOX-T0 project The VOBOX-T0 project has started before summer 2009 for the 4 LHC experiments GOALS: Track the specific experiment services running at 16 all T0 VOBOXES Provide the operators with a standard protocol to recover those services in case of problems The monitoring system used by the operators at CERN is LEMON Although VOBOXES do not have piquets, we can ensure a 1st level of support with easy procedures The 1st complete prototype is now ready for ALICE and available at voalice06@CERN Status of the ALICE Grid sites 29/10/09 Application to the ALICE VOBOXES A test suite has been created based on some critical tests executed in our SAM test suite 17 The test suite is running with root privileges via a cron Test suite includes: Status of the proxy renewal daemon Proper registration of the vobox into myproxy server Status of the gsisshd daemon Status of the ALICE Grid sites 29/10/09 Procedures for the ALICE VOBOXES 18 If one of the previous tests fails, the LEMON system locally intalled in the VOBOX will trigger an specific alarm This alarm will be visible for the operator at the Computer Center We have created a list of recovery procedures and actions linked from the page used by the operator A similar procedure will be applied to all experiments Advantages: Status of the ALICE Grid sites We will control in a 24/7 base the good behavior of the gLiteVOBOX service at CERN In case of problems the system can be recovered transparently to the experiment A list of experts is also available for the operator 29/10/09 Control of VOBOXES at the sites The project is intended for the T0 but the idea is to export it in the future to the T1 sites Any Tx interested similar procedures can contact 19 us to adapt it to any specific site Similar recovery procedures (or any) should exist at any site Status of the ALICE Grid sites Considering the level of support agreed at the site Site admins should react before being notified by ALICE Remember VOBOXES are critical services for ALICE 29/10/09 CREAM-CE Status of the ALICE Grid sites 29/10/09 Status of the CREAM-CE Requirement: ALICE presentation during the last September WLCG GDB meeting: 21 Service Status: CREAM1.5 version is deployed in production since the 6th of October 2009 Last WLCG MB (27th October 2009): WLCG sites should now deploy a CREAM CE service in parallel with the production LCGCEs in order to gain real production experience. Status of the ALICE Grid sites Deployment of CREAM-CE (T0, T1 and T2 sites) in parallel to the LCG-CE DEADLINE: Mid-November 2009 Although the ability to submit from Condor clients is still missing, the site installation will not change when this is available. The CREAM CEs should also now be marked as “production” in the information system In any case, the parallel deployment CREAM-CE vs. LCG-CE was already asked by the November 2008 GDB ALICE Sites Status Very far from the expectations presented in September 29/10/09 Status of the CREAM-CE at the sites Number of sites providing CREAM-CE to ALICE production 22 Same situation presented in September 2009 For the mentioned sites ALICE is running in stable production mode Situation of CC-IN2P3 Status of the ALICE Grid sites CERN, KISTI, INFN-Torino, CNAF, RAL, FZK, Kolkata, IHEP, SARA, Subatech, Legnaro, SPbSU, Prague Troitsk installation done and in testing phase CREAM integration with BQS (local batch system) done ALICE local support testing the system Next step: deployment of the system in production for ALICE testing We are waiting for the green light of the site admins 29/10/09 CREAM-CE issues in detail (I) CREAM1.5 has been deployed in production by the 6th of October (patch #3259 for SLC4/i386) 23 Patch details: https://savannah.cern.ch/patch/?func=detailitem&ite m_id=3259#options Important bug fixes for the ALICE production included in this new version It contains also fixes for two vulnerability reports associated to security issues for the sites Migration of sites to CREAM1.5 is highly encouraged and ALICE fully supports it for all sites providing this service for the experiment Status of the ALICE Grid sites 29/10/09 CREAM-CE issues in detail (II) Purge issues: 24 ALICE REPORT: Wrong report of job status. CREAM’s vision of running jobs de-synchronized ALICE REQUIREMENT: Method to purge jobs in a non terminal status CREAM STATUS: Status of the ALICE Grid sites CREAM job status can be wrongly reported because of some misconfigurations or because of these two bugs in the BLAH Blparser (Solution Status: integration) candidates for CREAM1.6 There is an specific bug which covers the ALICE requirement #55420: « Allow admin to purge CREAM jobs in a non terminal status » (Solution Status: in progress) IMPORTANT NOTE: The script provided to the ALICE site admins are indeed the bug fix. As soon as the relevant patch is release, the script will be part of the CREAM rpm CURRENT RISK FOR ALICE: Low once the developers provided site admins with the corresponding purge script (very high before) 29/10/09 CREAM-CE issues in detail (III) DISK SPACE issues (reported by Subatech): Areas to monitor and purge or clean 25 Status of the ALICE Grid sites ALICE REPORT: The local mysql DB grown up to 2.5 GB CREAM STATUS: Issue associated to mysql engine. While deleting entries from the DB, the relevant disk space is not released (therefore the CREAM DB does not decrease). But the space is reused when new data added in the DB CURRENT RISK FOR ALICE: low ALICE REPORT: purge of the input Sandboxes in /opt/glite/var/cream_sandbox CREAM STATUS: Solved in CREAM1.5 RISK FOR ALICE: none once sites upgrade to CREAM1.5 29/10/09 CREAM-CE issues in detail (IV) DISK SPACE issues (cont.) 26 Status of the ALICE Grid sites ALICE REPORT: issues regarding /opt/glite/var/log and /var/log ALICE REQUIREMENT: Cleaning policy required for these files, otherwise files can grow forever CREAM STATUS: policies exist for all these files and can be costumized file by file: RISK FOR ALICE: low since the size is manageble by site admins ALICE REPORT: issues regarding /opt/glite/var/cream/user_proxy CREAM STATUS: bug reported and accepted not available in CREAM1.5 RISK FOR ALICE: waiting for site admins feedback 29/10/09 CREAM-CE issues in detail (V) LOAD issues (reported by Subatech): 27 Status of the ALICE Grid sites ALICE REPORT: UNIX load going up to 5 (during startup or high rate of submission) CREAM STATUS: problem reported by GRNET and the origin of the problem was a missed index in the CREAM DB and solved in CREAM1.5 RISK FOR ALICE: low once upgrading the CREAM version ALICE REPORT: When tomcat restarted the system can take up to 15 min before submitting new jobs CREAM STATUS: The slow start of CREAM is also due to the problems coming from jobs reported in wrong status, not included in CREAM1.5 but will be released in CREAM1.6 RISK FOR ALICE: Purge actions should speed this startup and therefore decrease the risk for the experiment 29/10/09 CREAM-CE: conclusions All reported issues are: Solved in CREAM1.5 OR Known by the developers and they are working on their solutions to include them in CREAM1.6 OR 28 Developers have provided workarounds for ALICE In addition the support of the CREAM-CE developers and in special of Massimo Sgaravatto is exceptional Finally The WLCG MB is clearly promoting the deployment of this service at all sites SO THE QUESTION IS CLEAR: Why sites are not providing CREAM-CE? Status of the ALICE Grid sites 29/10/09 gLite3.2 WMS Status of the ALICE Grid sites 29/10/09 Status of the gLite3.2 WMS Until the full replacement of the LCG-CE by the CREAM-CE, ALICE will keep on using the gLiteWMS 30 Thanks to the improvements implemented already in AliEnv2.16, and the new versions of the service, the WMS is not our 1st source of problemsd anymore As it was one year ago Still some (rare) instabilities seen at CERN Good balance (and redundant) of local and powerful WMS defined in LDAP at all sites Sites providing this service must migrate to gLite3.2 WMS asap Status of the ALICE Grid sites 29/10/09 Summary 1. 2. 3. 31 Status of the ALICE Grid sites • Case-by-case we need to know what is preventing the deployment of these services • Regional experts are crucial for this task General message: REMEMBER: Site admins should see the problems at the local services BEFORE ALICE CREAM-CE gLite3.2 VOBOX SL5 WNs This is not the case at the current moment Ensure the availability of the local VOBOX at any moment, the good publication of the queues at the BDII, the good performance of the CREAM-CE and the local WMS You are plenty of monitoring systems for this In addition WMS should not require the level of baby-sitting we had one year ago Assuming that ALICE will continue in data taking mode during the Christmas time, it will be worth to establish a list of local experts available during the vacations period 29/10/09
© Copyright 2026 Paperzz