Slides - Indico

TSA1.4
Infrastructure for Grid
Management
Tiziana Ferrari, EGI.eu
EGI-InSPIRE – SA1 Kickoff Meeting
1
Goal
• The purpose of this task is the deployment
of the infrastructure for Grid management
consisting of a set of services and tools
needed by the NGI/EIRO Operations
Centres regionally and/or centrally for the
running of the Grid software services, for
Grid monitoring (including SLA and
security monitoring), and ongoing Grid
management.
Internal O-N and O-E tasks
• O-E-1 GOCDB 0.5 FTE, UK
• O-E-3 Monitoring infrastructure 0.25 CERN, 0.25
GRNET,
• O-E-4 Operations portal and dashboard 0.25
FTE FR
• O-E-12 Tools for network troubleshooting and
monitoring 0.25 FTE IT
• O-N-1 Grid topology database
• O-N-3 Grid repositories (for operational tools)
• O-N-4 operations portal and dashboard
O-E-1 GOCDB deployment:
Current situation
Read only
Read/Write
EGI
tools
central
tools
central users
Local users
REGION / NGI
GOCDBPI_v4
GUI WS
GOCDBPI
WS GUI
GOCDB module
CENTRAL GOCDB4
Courtesy of G.Mathieu
GOCDB3
GOCDB deployment:
Wanted situation
Read only
Read/Write
EGI
tools
central
tools
central users
Local users
REGION / NGI
GOCDBPI_v4
GUI WS
WS GUI
GOCDB module
GOCDB module
CENTRAL GOCDB4
INPUT GOCDB4
Release timeline
First half of July, if well planned and well announced; accounting portal still
relying on GOCDB3
O-E-3 Montoring
• Validation of Nagios instances
– Nagios migrated on May 26th :
• ROC: ITALY, UKI
• NGI: NGI_Greece
– Nagios migrated on June 1st:
•
•
•
•
ROC Central Europe
ROC IGALC
ROC Latin America
ROC South Western Europe
• Remaining instances will be migrated during June:
– ROC: AP, Canada, France, Germany/Switzerland, NE, Russia, SEE
– NGI: NGI_PL, NGI_France, NGI_BY, NGI_SK, NGI_SI, NGI_HR, NGI_CZ
(by now running on CERN Nagios instances)
Courtesy of J.Casey, D.Collados
O-E-3 Monitoring (cont)
• Nagios-based availability/reliability reports compared to
SAM reports
– Statistics comparable (small improvement with Nagios by its
design)
• SAM
– Proposed date for switching off: June 15
• MyEGI portal deployment model:
– central project instance (CERN) + NGI instances
• Monitoring of monitoring
– https://ops-monitor.cern.ch/nagios/
– Requested feedback and ideas for more services/probes to
deploy (got some input from the ENOC)
O-E-3 Monitoring: Central DBs status
• Central Oracle DBs currently deployed at CERN:
– Aggregated Topology Provider (ATP)
– Metric Description Database (MDDB)
– Metric Results Store (MRS)
• Evolution During Y1:
–
–
–
–
Improve profiles management in MDDB
Implement history functionality in ATP
Integrate & deploy the three DBs into one single account
Maintenance & bug fixing
8
O-E-3 Monitoring: Messaging
• Currently: 3 sites with brokers +1 broker
for APEL accounting
• Y1 evolution:
– it was an aim of the general broker network to
support authorization as required by APEL
– APEL to migrate once that has been achieved
– Until then APEL will run one or more brokers
to support APEL depending on STFC view of
the risks of a single point of failure.
O-E-4 Operations portal
and dashboard
• 2 Central Web Applications :
– historical portal: http:cic.gridops.org
– recent portal: http://operations-portal.in2p3.fr
• hosting the Operations Dashboard Module
• This module will be proposed in a regional package: June 8th
• Other features will be migrated progressively
to the new portal and integrated step by step
in the regional package
Courtesy of C. L’Orphelin
O-E-4 Central Instance of the
dashboard: Architecture
O-E-4 Availibility and failover
• High availability context :
– Each configuration of Lavoisier is copied in SVN
– The database Mysql is backed-up
• Restoration of the back-up : 30 min
– The Web machine is hosted in a cluster
• No automatic failover yet .
• The DNS switch and the replication of data will
be studied during the 1st year .
• The central instance could be used in case of
troubles on the Regional instances.
O-E-4: Migration plans
• Migration to the rest of key features to Symfony
and the new Portal :
–
–
–
–
VO ID Card
Broadcast tool
User tracking
VO / Sites resources browser
• Propose regional modules when possible of
these features
O-E-12 Network tools
DownCollector
• Polling tool reporting on reachability of GOCDB services (tests on TCP
ports)
• Central server running the probes, star-based architecture
• EGEE III instance: https://ccenoc.in2p3.fr/DownCollector/ migrated to
GARR (Italy) https://perfsonarlitetss.dir.garr.it/DownCollector/
– will be accessible through a new portal dedicated to the O-E-12 task, which
will be available at the URL http://eginet.garr.it to be setup
• High Availability currently not available (to be defined in Y1)
• Originally developed by IN2P3 CC-Lyon (EGEE SA2)  GARR
Courtesy of M.Reale
14
O-E-12 Network troubleshooting
• perfSONAR-lite TroubleShooting Services
• Started in EGEE-III, entirely designed by SA2
• Developments lead by DFN/Erlangen
• Central server orchestrating on demand e2e measurements
between light probes hosted by Grid sites
•
•
•
•
•
Bandwidth measurements
DNS lookup
Traceroute
Port testing
Ping
O-E-12 perfSONAR-lite TSS
http://www.dfn.de/en/enhome/x-win/download-of-perfsonar-lite-tss/
16
O-E-12 perfSONAR-lite TSS: future
– initial deployment strategy within the EGI required
• O-E-12 testing and deployment campaigns in the
next weeks
– core development needed to further improve
security related to available bandwidth tests and
simply AA
– DFN and CNRS are interested in be engaged
with the future development
17
Y1 Milestones and deliverables
• MS401 Operational Tools regionalisation
status (INFN) PM1 in collaboration with
TSA1.5
• Contribution to MSA406 “Deployment plan
for the distribution of operational tools to
the NGIs/EIROs “ (see TSA1.3)
• Contribution to MSA404 “Operational
Level Agreements (OLAs)“ (see TSA1.8)
Short/medium term issues
• Migration to nagios server final layout, upgrade
of the dashboard and gstat, fasing out of
GOCDB3
• Is current failover/HA of central operational tools
sufficient?
• Measurement of availability/reliability of tools
(central/regional MyEGI portals, dashboard,
GGUS, regional helpdesk, central/regional
monitoring infrastructure,...)
• Contribution to the definition of OLAs concerning
tools