TSA1.4 Infrastructure for Grid Management Tiziana Ferrari, EGI.eu EGI-InSPIRE – SA1 Kickoff Meeting 1 Goal • The purpose of this task is the deployment of the infrastructure for Grid management consisting of a set of services and tools needed by the NGI/EIRO Operations Centres regionally and/or centrally for the running of the Grid software services, for Grid monitoring (including SLA and security monitoring), and ongoing Grid management. Internal O-N and O-E tasks • O-E-1 GOCDB 0.5 FTE, UK • O-E-3 Monitoring infrastructure 0.25 CERN, 0.25 GRNET, • O-E-4 Operations portal and dashboard 0.25 FTE FR • O-E-12 Tools for network troubleshooting and monitoring 0.25 FTE IT • O-N-1 Grid topology database • O-N-3 Grid repositories (for operational tools) • O-N-4 operations portal and dashboard O-E-1 GOCDB deployment: Current situation Read only Read/Write EGI tools central tools central users Local users REGION / NGI GOCDBPI_v4 GUI WS GOCDBPI WS GUI GOCDB module CENTRAL GOCDB4 Courtesy of G.Mathieu GOCDB3 GOCDB deployment: Wanted situation Read only Read/Write EGI tools central tools central users Local users REGION / NGI GOCDBPI_v4 GUI WS WS GUI GOCDB module GOCDB module CENTRAL GOCDB4 INPUT GOCDB4 Release timeline First half of July, if well planned and well announced; accounting portal still relying on GOCDB3 O-E-3 Montoring • Validation of Nagios instances – Nagios migrated on May 26th : • ROC: ITALY, UKI • NGI: NGI_Greece – Nagios migrated on June 1st: • • • • ROC Central Europe ROC IGALC ROC Latin America ROC South Western Europe • Remaining instances will be migrated during June: – ROC: AP, Canada, France, Germany/Switzerland, NE, Russia, SEE – NGI: NGI_PL, NGI_France, NGI_BY, NGI_SK, NGI_SI, NGI_HR, NGI_CZ (by now running on CERN Nagios instances) Courtesy of J.Casey, D.Collados O-E-3 Monitoring (cont) • Nagios-based availability/reliability reports compared to SAM reports – Statistics comparable (small improvement with Nagios by its design) • SAM – Proposed date for switching off: June 15 • MyEGI portal deployment model: – central project instance (CERN) + NGI instances • Monitoring of monitoring – https://ops-monitor.cern.ch/nagios/ – Requested feedback and ideas for more services/probes to deploy (got some input from the ENOC) O-E-3 Monitoring: Central DBs status • Central Oracle DBs currently deployed at CERN: – Aggregated Topology Provider (ATP) – Metric Description Database (MDDB) – Metric Results Store (MRS) • Evolution During Y1: – – – – Improve profiles management in MDDB Implement history functionality in ATP Integrate & deploy the three DBs into one single account Maintenance & bug fixing 8 O-E-3 Monitoring: Messaging • Currently: 3 sites with brokers +1 broker for APEL accounting • Y1 evolution: – it was an aim of the general broker network to support authorization as required by APEL – APEL to migrate once that has been achieved – Until then APEL will run one or more brokers to support APEL depending on STFC view of the risks of a single point of failure. O-E-4 Operations portal and dashboard • 2 Central Web Applications : – historical portal: http:cic.gridops.org – recent portal: http://operations-portal.in2p3.fr • hosting the Operations Dashboard Module • This module will be proposed in a regional package: June 8th • Other features will be migrated progressively to the new portal and integrated step by step in the regional package Courtesy of C. L’Orphelin O-E-4 Central Instance of the dashboard: Architecture O-E-4 Availibility and failover • High availability context : – Each configuration of Lavoisier is copied in SVN – The database Mysql is backed-up • Restoration of the back-up : 30 min – The Web machine is hosted in a cluster • No automatic failover yet . • The DNS switch and the replication of data will be studied during the 1st year . • The central instance could be used in case of troubles on the Regional instances. O-E-4: Migration plans • Migration to the rest of key features to Symfony and the new Portal : – – – – VO ID Card Broadcast tool User tracking VO / Sites resources browser • Propose regional modules when possible of these features O-E-12 Network tools DownCollector • Polling tool reporting on reachability of GOCDB services (tests on TCP ports) • Central server running the probes, star-based architecture • EGEE III instance: https://ccenoc.in2p3.fr/DownCollector/ migrated to GARR (Italy) https://perfsonarlitetss.dir.garr.it/DownCollector/ – will be accessible through a new portal dedicated to the O-E-12 task, which will be available at the URL http://eginet.garr.it to be setup • High Availability currently not available (to be defined in Y1) • Originally developed by IN2P3 CC-Lyon (EGEE SA2) GARR Courtesy of M.Reale 14 O-E-12 Network troubleshooting • perfSONAR-lite TroubleShooting Services • Started in EGEE-III, entirely designed by SA2 • Developments lead by DFN/Erlangen • Central server orchestrating on demand e2e measurements between light probes hosted by Grid sites • • • • • Bandwidth measurements DNS lookup Traceroute Port testing Ping O-E-12 perfSONAR-lite TSS http://www.dfn.de/en/enhome/x-win/download-of-perfsonar-lite-tss/ 16 O-E-12 perfSONAR-lite TSS: future – initial deployment strategy within the EGI required • O-E-12 testing and deployment campaigns in the next weeks – core development needed to further improve security related to available bandwidth tests and simply AA – DFN and CNRS are interested in be engaged with the future development 17 Y1 Milestones and deliverables • MS401 Operational Tools regionalisation status (INFN) PM1 in collaboration with TSA1.5 • Contribution to MSA406 “Deployment plan for the distribution of operational tools to the NGIs/EIROs “ (see TSA1.3) • Contribution to MSA404 “Operational Level Agreements (OLAs)“ (see TSA1.8) Short/medium term issues • Migration to nagios server final layout, upgrade of the dashboard and gstat, fasing out of GOCDB3 • Is current failover/HA of central operational tools sufficient? • Measurement of availability/reliability of tools (central/regional MyEGI portals, dashboard, GGUS, regional helpdesk, central/regional monitoring infrastructure,...) • Contribution to the definition of OLAs concerning tools
© Copyright 2026 Paperzz