Proposal for COD and CE ROC 1st line support cooperation

Admin Matters
Enabling Grids for E-sciencE
• Vera Hanser – NDGF
• Jan Astalos – IINAS
• COD dinner downtown on Thursday night :
Fill in attendance sheet if interested
• COD-15 : Lyon 06-08 Feb 2008
EGEE-II INFSO-RI-031688
ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007
COD Working groups leaders
Enabling Grids for E-sciencE
Phone conference COD topics leaders : Jan 11th – TBC
Update wiki
Find deputies
Straightforward mandate working groups:
•
GSTAT – TW,
•
SAM – CERN,
•
SAMAP – CE,
-
Improvement of work tools – CE
-
Improvement of work practices – DE-CH/FR
-
Release of updated documentation –SE/
-
Integration of the existing tools – FR
-
Set-up of High Availability strategy of the operational tools for CODs – IT
NEW, NEW, NEW:
-
Set-up of Failover Mechanisms for Grid Core Services
Inter federations -- e.g. VOMS -- SWE
EGEE-II INFSO-RI-031688
ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007
Enabling Grids for E-sciencE
Proposal for COD and CE ROC
1st line support cooperation
Jan Astalos, Marcin Radecky
CE ROC
www.eu-egee.org
EGEE-II INFSO-RI-031688
EGEE and gLite are registered trademarks
Rationale
Enabling Grids for E-sciencE
• The rationale on this topic can be found at the following URL:
http://goc.grid.sinica.edu.tw/gocwiki/TIC_1st_line_support_integra
tion
Background: In CE region a team of technical grid experts is
working on a 8/5 basis to help site CE admins in solving any
problem with their grid site. The experts assist the site admin from
the problem origin to the solution by actively searching for the
solution i.e. doing detailed diagnosis at the remote site, writing
necessary scripts etc.
It happened that 1st line support was noticed by the site admin of
a problem and already found a solution, but despite of it, due to
some monitoring system latency the site was assigned a ticket
from COD. That was the basis to start thinking of how COD team
could benefit from 1st line support team existence in the region.
EGEE-II INFSO-RI-031688
ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007
Integration of regional support
Enabling Grids for E-sciencE
 In the daily operations, this would materialize in a specific
dashboard set-up in the ROC section, where CE 1rst would handle
SAM alarms for CE sites during their 1rst day of occurence.
Then, the alarms still open, would be handled as usual by the
regular COD teams.
• The mechanism is thought to be transparent for the COD activity.
CE federation would still be part of the regular COD teams so there
would be no specific ajustments needed.
Finally, discussions on the modification of the tool in the CIC
operations portal would be set up for Jan 1rst 2008.
Conclusions and analysis could be drawn at the end of EGEE-II.
EGEE-II INFSO-RI-031688
ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007
1st line support in CE ROC
Enabling Grids for E-sciencE
• On-duty shifts covering working hours
– IISAS (4 days) and PSNC (1 day)
• Problem detection
– SAM, Gstat, Nagios for Central Europe
• Analysis of problems
– Diagnostic jobs/tests, remote analysis of log files, interactive Grid
login tool
• Sending notifications to sites
– Direct e-mail to site contact address, IM, Skype chat
• Assistance to site admins in problem solving
– Interactive support usually via IM or e-mail exchange
• SAMAP jobs for checking if the problem is solved
• Daily reports with problem summaries
– To ROC representative + other 1st line supporters
– Issues to be raised on weekly operations meeting
• Sending GGUS tickets to developers, etc.
EGEE-II INFSO-RI-031688
ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007
6
Proposed cooperation with COD
Enabling Grids for E-sciencE
• Main goals
– To avoid tickets on problems that are already solved
– To decrease the effort needed for alarm/ticket processing at
COD level and also at site level
• Proposal
– To inform COD about status of problem analysis using alarm
annotation
– To pass results of detailed problem analysis to COD
– To give sites and 1st line support one day grace time to fix noncritical problems
 if site admins do not respond to notification from 1st line support,
they will receive ticket from COD
• Issues
– If alarm annotation is not implemented, we can use site
annotation
– Urgent problems at sites
 COD can decide to send a ticket immediately
 + 1st line can use other communication channels to reach site
admins
EGEE-II INFSO-RI-031688
ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007
7