Status of the ALICE Grid sites

Status of the
ALICE Grid
sites
ALICE Offline Week
CERN, 29th October 2009
Patricia Méndez Lorenzo
(CERN, IT/GS-EIS)
Outlook
ALICE is already in data taking mode
 This implies

Maximal automatization and control of the local
services by the site admins is needed

2

Follow up of the (local) WLCG services cannot be under
the control of the central Grid team at CERN
 101 VOBOXES distributed in 83 sites require an
active implication of site admins and regional experts
We need a clear statement of each site in terms of:
1.
2.
3.
4.
Available resources and setup
Available workload management system
Management of local VOBOXES and backups
Recovery procedures and protocols
These points will be the topic of this talk
Status of the ALICE Grid sites
29/10/09
WNs in SL5
Status of the ALICE Grid sites
29/10/09
WNs in SL(C)5

Requirement:

4

Requirement: deployment of SL5 in all WNs (T0, T1 and T2
sites)

DEADLINE: Mid-September 2009
Service was deployed in production by the 23th of March 2009
By the 4th of August 2009, the WLCG MB promoted
specifically the deployment of SL5 WNs at all sites
ALICE Sites Status


Status of the ALICE Grid sites

Service Status:



ALICE presentation during the last September WLCG GDB
meeting:
Experiment is asking the deployment of this service in all sites
since Summer 2009
We are very far from the expectations presented in September
29/10/09
Status of the sites... We know about
(I)
(Deployment of SL5 and CREAM-CE at the sites are related)
 CERN (T0)

SLC4 and SLC5, ALICE currently using SLC5
resources only
 RAL (T1)
5

Used together with CERN to test the ALICE sw in SL5,
mostly migrated to SL5
 CNAF (T1)

Testing of SL5 ongoing. Most of the resources for LHC
still in SL4 (mixed situation)
 CCIN2P3 (T1)

Configuration of the site upgraded yesterday to
increase the number of available SL5 resources
 FZK (T1)

Status of the ALICE Grid sites
Migration to SL5 finished
29/10/09
Status of the sites... We know about
(II)
 Madrid (T2)

Migrated to SL5
 Trujillo (T2)
6

Migrated to SL5
 Hiroshima (T2)

Beginning the migration of the whole site to SL5
 KISTI (T2)

Mixed situation, both types of WNs available
 Subatech (T2)

Status of the ALICE Grid sites
Mixed situation, both types of WNs available
29/10/09
Status of the sites... We know about
(III)
 ITEP (T2)

Migrated to SL5
 SPBSU (T2)
7

Migrated to SL5
 Kolkata (T2)

Mixed situation, SL4 and SL5 available
 iPNL (T2)

Migrated to SL5
 IHEP (T2)

Status of the ALICE Grid sites
Migrated to SL5
29/10/09
Status of the sites... We know about
(IV)
 RRC-KI (T2)

Migrated to SL5
 GRIF_DAPNIA (T2)
8

Migrated to SL5
 JINR (T2)

Migrated to SL5
 Troitsk (T2)

Status of the ALICE Grid sites
Migrated to SL5
29/10/09
The gLite3.2 VOBOX
Status of the ALICE Grid sites
29/10/09
Evolution of the service

« Pre-PPS » version announced at the beginning of October

Patch #3205 (SL5 VOBOX) and #3040 (WMS UI fixes)
installed at CERN for testing purposes

Instructions provided by IT-GD and distributed through ALICE
TF to sites, specifing:

10


New rpm issues stopped the deployment of the patch for
almost 4 weeks

ALICE sites have been informed about any advance and
issue at any moment

Two sites more had migrated the VOBOX already: KISTI and
ITEP
The patch is in production since the 13th October 2009

Status of the ALICE Grid sites
It is a testing patch to gain familiary with the system and its
installation
However only 12 sites are providing today this new version
29/10/09
The gLite3.2 VOBOX: Issues
 Thanks to the availability of the gLite3.2
VOBOXES, we were able to find a compatibility
problem associated to the libs

11
The libs needed by the new VOBOX for its normal
operations (proxy-renewal, etc) entered in
concurrence with those provided by AliEn
 Status: SOLVED

Changes in the environment setup before and after
any VOBOX specific operation
 Further issues: NONE

The current gLite3.2 VOBOX should not show any
further problem for the experiment or the site admins

Status of the ALICE Grid sites
Very easy installation and configuration
29/10/09
Status of the SL5 vobox at the sites
(I)
 CERN (T0)

In production since more than one month
 CNAF (T1)
12

Available in production
 FZK (T1)

Registration of the new VOBOX DN required
 RAL (T1):

Status of the ALICE Grid sites
HW already located, installation ongoing
29/10/09
Status of the SL5 vobox at the sites
(II)
 Subatech (T2)

Using the PPS installation
 POZNAN (T2), ITEP (T2), Troistk (T2), KISTI
13
(T2), SPBU (T2), Kolkata (T2)

Deployed
 JINR (T2)

Registration of the new DN required
 Hiroshima (T2)

Status of the ALICE Grid sites
Installation ongoing
29/10/09
Hybrid situation
 Hybrid (worse) situation: WNs in SL4/32b and
SL5/64b
14

This is the situation ALICE wanted to avoid in
September

The experiment can manage this situation if and
only if:

2 VOBOXES are provided (each one will run an
independent PackMan service)

Status of the ALICE Grid sites
and this independent of CREAM!

2 different software areas (per VOBOX, per cluster)
are provided


The sw version changes with the architecture
The support infrastructure has to be doupled
29/10/09
Control and management of
the VOBOXES
Status of the ALICE Grid sites
29/10/09
The VOBOX-T0 project
 The VOBOX-T0 project has started before
summer 2009 for the 4 LHC experiments
 GOALS:
 Track the specific experiment services running at
16

all T0 VOBOXES
Provide the operators with a standard protocol to
recover those services in case of problems


The monitoring system used by the operators at
CERN is LEMON
Although VOBOXES do not have piquets, we can
ensure a 1st level of support with easy procedures
 The 1st complete prototype is now ready for ALICE
and available at voalice06@CERN
Status of the ALICE Grid sites
29/10/09
Application to the ALICE VOBOXES
 A test suite has been created
based on some critical tests
executed in our SAM test suite
17
 The test suite is running with root
privileges via a cron
 Test suite includes:
 Status of the proxy renewal daemon
 Proper registration of the vobox into
myproxy server
 Status of the gsisshd daemon
Status of the ALICE Grid sites
29/10/09
Procedures for the ALICE VOBOXES
18

If one of the previous tests fails, the LEMON system
locally intalled in the VOBOX will trigger an specific alarm

This alarm will be visible for the operator at the Computer
Center

We have created a list of recovery procedures and
actions linked from the page used by the operator

A similar procedure will be applied to all experiments

Advantages:



Status of the ALICE Grid sites
We will control in a 24/7 base the good behavior of the gLiteVOBOX service at CERN
In case of problems the system can be recovered transparently
to the experiment
A list of experts is also available for the operator
29/10/09
Control of VOBOXES at the sites
 The project is intended for the T0 but the idea is
to export it in the future to the T1 sites
 Any Tx interested similar procedures can contact
19
us to adapt it to any specific site
 Similar recovery procedures (or any) should exist
at any site
Status of the ALICE Grid sites


Considering the level of support agreed at the site

Site admins should react before being notified by
ALICE
Remember VOBOXES are critical services for
ALICE
29/10/09
CREAM-CE
Status of the ALICE Grid sites
29/10/09
Status of the CREAM-CE

Requirement:

ALICE presentation during the last September WLCG GDB
meeting:


21

Service Status:


CREAM1.5 version is deployed in production since the 6th of
October 2009
Last WLCG MB (27th October 2009): WLCG sites should now
deploy a CREAM CE service in parallel with the production LCGCEs in order to gain real production experience.



Status of the ALICE Grid sites
Deployment of CREAM-CE (T0, T1 and T2 sites) in parallel to the
LCG-CE
DEADLINE: Mid-November 2009
Although the ability to submit from Condor clients is still missing, the
site installation will not change when this is available. The CREAM
CEs should also now be marked as “production” in the information
system
In any case, the parallel deployment CREAM-CE vs. LCG-CE was
already asked by the November 2008 GDB
ALICE Sites Status

Very far from the expectations presented in September
29/10/09
Status of the CREAM-CE at the
sites

Number of sites providing CREAM-CE to ALICE
production


22
Same situation presented in September 2009

For the mentioned sites ALICE is running in stable
production mode

Situation of CC-IN2P3




Status of the ALICE Grid sites
CERN, KISTI, INFN-Torino, CNAF, RAL, FZK, Kolkata, IHEP,
SARA, Subatech, Legnaro, SPbSU, Prague
Troitsk installation done and in testing phase
CREAM integration with BQS (local batch system) done
ALICE local support testing the system
Next step: deployment of the system in production for ALICE
testing
We are waiting for the green light of the site admins
29/10/09
CREAM-CE issues in detail (I)
 CREAM1.5 has been deployed in production by
the 6th of October (patch #3259 for SLC4/i386)

23
Patch details:
https://savannah.cern.ch/patch/?func=detailitem&ite
m_id=3259#options


Important bug fixes for the ALICE production
included in this new version
It contains also fixes for two vulnerability reports
associated to security issues for the sites
Migration of sites to CREAM1.5 is highly
encouraged and ALICE fully supports it for all
sites providing this service for the experiment
Status of the ALICE Grid sites
29/10/09
CREAM-CE issues in detail (II)
 Purge issues:
24

ALICE REPORT: Wrong report of job status. CREAM’s vision of
running jobs de-synchronized

ALICE REQUIREMENT: Method to purge jobs in a non terminal
status

CREAM STATUS:

Status of the ALICE Grid sites

CREAM job status can be wrongly reported because of some
misconfigurations or because of these two bugs in the BLAH Blparser
(Solution Status: integration) candidates for CREAM1.6

There is an specific bug which covers the ALICE requirement

#55420: « Allow admin to purge CREAM jobs in a non terminal status »
(Solution Status: in progress)

IMPORTANT NOTE: The script provided to the ALICE site admins are
indeed the bug fix. As soon as the relevant patch is release, the script
will be part of the CREAM rpm
CURRENT RISK FOR ALICE: Low once the developers provided
site admins with the corresponding purge script (very high before)
29/10/09
CREAM-CE issues in detail (III)
 DISK SPACE issues (reported by Subatech): Areas
to monitor and purge or clean


25




Status of the ALICE Grid sites
ALICE REPORT: The local mysql DB grown up to 2.5
GB
CREAM STATUS: Issue associated to mysql engine.
While deleting entries from the DB, the relevant disk
space is not released (therefore the CREAM DB does
not decrease). But the space is reused when new data
added in the DB
CURRENT RISK FOR ALICE: low
ALICE REPORT: purge of the input Sandboxes in
/opt/glite/var/cream_sandbox
CREAM STATUS: Solved in CREAM1.5
RISK FOR ALICE: none once sites upgrade to
CREAM1.5
29/10/09
CREAM-CE issues in detail (IV)
 DISK SPACE issues (cont.)
26
Status of the ALICE Grid sites

ALICE REPORT: issues regarding /opt/glite/var/log and
/var/log

ALICE REQUIREMENT: Cleaning policy required for
these files, otherwise files can grow forever

CREAM STATUS: policies exist for all these files and
can be costumized file by file:

RISK FOR ALICE: low since the size is manageble by
site admins

ALICE REPORT: issues regarding
/opt/glite/var/cream/user_proxy

CREAM STATUS: bug reported and accepted not
available in CREAM1.5

RISK FOR ALICE: waiting for site admins feedback
29/10/09
CREAM-CE issues in detail (V)
 LOAD issues (reported by Subatech):


27




Status of the ALICE Grid sites
ALICE REPORT: UNIX load going up to 5 (during
startup or high rate of submission)
CREAM STATUS: problem reported by GRNET and the
origin of the problem was a missed index in the
CREAM DB and solved in CREAM1.5
RISK FOR ALICE: low once upgrading the CREAM
version
ALICE REPORT: When tomcat restarted the system
can take up to 15 min before submitting new jobs
CREAM STATUS: The slow start of CREAM is also due
to the problems coming from jobs reported in wrong
status, not included in CREAM1.5 but will be released
in CREAM1.6
RISK FOR ALICE: Purge actions should speed this
startup and therefore decrease the risk for the
experiment
29/10/09
CREAM-CE: conclusions
All reported issues are:
Solved in CREAM1.5 OR
Known by the developers and they are working on their
solutions to include them in CREAM1.6 OR
28
Developers have provided workarounds for ALICE
In addition the support of the CREAM-CE developers and
in special of Massimo Sgaravatto is exceptional
Finally The WLCG MB is clearly promoting the
deployment of this service at all sites
SO THE QUESTION IS CLEAR:
Why sites are not providing CREAM-CE?
Status of the ALICE Grid sites
29/10/09
gLite3.2 WMS
Status of the ALICE Grid sites
29/10/09
Status of the gLite3.2 WMS
 Until the full replacement of the LCG-CE by the
CREAM-CE, ALICE will keep on using the gLiteWMS
30
 Thanks to the improvements implemented
already in AliEnv2.16, and the new versions of
the service, the WMS is not our 1st source of
problemsd anymore



As it was one year ago
Still some (rare) instabilities seen at CERN
Good balance (and redundant) of local and powerful
WMS defined in LDAP at all sites
 Sites providing this service must migrate to
gLite3.2 WMS asap
Status of the ALICE Grid sites
29/10/09
Summary
1.
2.
3.

31



Status of the ALICE Grid sites
• Case-by-case we need to know
what is preventing the deployment
of these services
• Regional experts are crucial for
this task
General message:
REMEMBER: Site admins should see the problems at the local
services BEFORE ALICE



CREAM-CE
gLite3.2 VOBOX
SL5 WNs
This is not the case at the current moment
Ensure the availability of the local VOBOX at any moment, the good
publication of the queues at the BDII, the good performance of the
CREAM-CE and the local WMS
You are plenty of monitoring systems for this
In addition

WMS should not require the level of baby-sitting we had one year
ago
Assuming that ALICE will continue in data taking mode during
the Christmas time, it will be worth to establish a list of local
experts available during the vacations period
29/10/09