Enabling Grids for E-sciencE Operating Central European EGEE ROC Marcin Radecki, Tomasz Szepieniec, Aleksander Kusznir and Marian Bubak ACC CYFRONET AGH www.eu-egee.org CGW’06 EGEE-II INFSO-RI-031688 17 October 2006 EGEE and gLite are registered trademarks Outline Enabling Grids for E-sciencE • Introduction – EGEE and Central European (CE) Region • Challenges for CE Regional Operating Centre – Applications & Users – Cooperation – Grid Infrastructure • Conclusions EGEE-II INFSO-RI-031688 CGW’06; Cracow; 15-18th October 2006 2 EGEE – Community Enabling Grids for E-sciencE • • • • Possibly largest production infrastructure spans over 32 countries c.a. 200 sites grouped under 11 ROCs Scientific community involves over 2000 people EGEE’06 conference in Geneva – 700 attendees, – 32 „partner” projects present ID EGEE-001 EGEE-002 EGEE-003 EGEE-004 EGEE-010 EGEE-014 EGEE-039 EGEE-040 EGEE-042 EGEE-065 EGEE-066 Name Discipline Atlas Physics Alice Physics LHCb Physics CMS Physics ESR Earth Sciences Biomed Biomed Comp Chem Chemistry Magic Astro particle physics dteam Infrastructure testing EGEODE Geo-Physics Planck Astrophysics Total EGEE-II INFSO-RI-031688 Discipline Users 890 175 159 632 42 114 15 16 30 33 8 2114 VOs CGW’06; Cracow; 15-18th October 2006 3 Central European Region in EGEE Enabling Grids for E-sciencE • • • • 7 countries, 22 sites, 1493 CPUs, 70 TB storage space Supports 10/11 EGEE-approved + lot of associated VOs Site size scales from 2-3 to 300 CPUs Need for solutions suitable for both large computing centres and small sites – Maintenance model – Skills & experience – Scalable across a site’s resources EGEE-II INFSO-RI-031688 CGW’06; Cracow; 15-18th October 2006 4 Challenges for CE ROC Enabling Grids for E-sciencE • We need to attract new users to grid and make possible their work in the new environment in order to use the resources efficiently. Provide the services the users require. • Grid spans across many administrative domains, each of which need to be active in terms of cooperation to share resources and collaborate productively. Excellent possibility for expertise sharing. • Having resources is not enough; infrastructure need to be stable before real users start to use it and we should maximize utilization as possible. EGEE-II INFSO-RI-031688 CGW’06; Cracow; 15-18th October 2006 5 Grid-enabling users Enabling Grids for E-sciencE • Means to gain and uphold users with us – Understand users’ needs and satisfy them – Easy access, how-to-use documentation (in national languages) – Stable working environment – User Support infrastructure • Results: – Computational chemistry Mariusz Sterzel (CYFRONET) coordinates computational chemistry applications in EGEE Enabling commercial software - Gaussian VO Study on pyrazoloquinolines (PQ) used for laser light generation – Bioinformatics Never Born Protein folding and function recognition - Prof. Irena Roterman team (CM-UJ) – Others: Many small teams are working within regional catch-all VO – VOCE EGEE-II INFSO-RI-031688 CGW’06; Cracow; 15-18th October 2006 6 VOs in the Region Enabling Grids for E-sciencE • • Supported VOs list alice, atlas, auger, balticgrid, belle biomed, cms, compass, compchem, crogrid, esr, euchina., gamess. gaussian, geant4, gear, geclipse, hone, hungrid, lhcb, magic, ops, skgrid, voce, vocet, zeus Service/Data Challenges and test productions – Atlas Service Challenge 4 – World-wide In Silico Docking On Malaria data challenge 1st and 2nd (ongoing) – EGEE-ITU International digital broadcasting agreement – new frequency plan compatibility and complementary analysis EGEE-II INFSO-RI-031688 CGW’06; Cracow; 15-18th October 2006 7 Managment of CE ROC Enabling Grids for E-sciencE • ROC Manager – Represents the region at the level of the Project managerial bodies – Supervises all Service Activities • Operations – Coordinate actions related to infrastructure and middleware – Escalates unsolvable problems level higher – Fit the Project requirements into the region • User Support – Provides support tools for users – Takes part in shifts handling all user tickets in GGUS system • Security – Incident handling procedures – Incident response team EGEE-II INFSO-RI-031688 ROC Manager User Support Responsible Operations Responsible Security Responsible 1st Line Support Core Grid Services Regional Certification of Middleware Grid Operator On Duty Pre-Production Service CYFRONET IISAS/PSNC CESNET/PSNC ICM WARSAW CGW’06; Cracow; 15-18th October 2006 8 Procedures and Commitments Enabling Grids for E-sciencE • Well defined procedures makes collaboration more efficient – Clear paths on how we deal with things to avoid misunderstandings – Newbies are always there – People tend to forget things over the time • Procedures examples: – – – – New site registration New site admin joining Site problem handling Sending Weekly Reports • Commitments monitoring makes people more motivated EGEE-II INFSO-RI-031688 CGW’06; Cracow; 15-18th October 2006 9 Operations - coordinate the work Enabling Grids for E-sciencE • Operations is the most time consuming task – To make sure that operational procedures are understood and followed up properly – To ensure production requirements are met at the sites – To work out best solutions for problems – To understand expectations/needs – To make sure problems are being solved in a proper way – To ensure weekly reports are completed and sent • Three styles of site administration observed – Keep all services ready all the time – „I’m the best admin in the city” – React only when gets a problem report – „I’m a bit occupied” – React only if my name appears on a „black list”, available to the public – „I’m hard-working on… something important” EGEE-II INFSO-RI-031688 CGW’06; Cracow; 15-18th October 2006 10 Resources and their usage Enabling Grids for E-sciencE • Accounting in EGEE – July-October ’06 - over 672k CPU hours computed in CE region; equivalent of 275 CPUs running 24x7 – Problems with „missing” data – Update rate: daily • Max. CPUs Our approach to accounting – Site performance efficiency study: - Up-to-date information on what is going at a site, - Maximize site utilization Jobs Executing Jobs Queued better to have jobs queued at a site than idle CPUs – Is being extended towards a new system for fine grain accounting EGEE-II INFSO-RI-031688 Avoid low usage periods CGW’06; Cracow; 15-18th October 2006 11 Stable infrastructure - social aspect Enabling Grids for E-sciencE • How EGEE keeps the Grid stable – Grid Operator on Duty (GOD) watching entire grid CE joined this activity in a first turn in EGEE-II – Raise a ticket for each detected problem – Problem diagnosis and solution suggestion – Use monitoring tools for problem detection and availability metrics • 1st Line Support in CE - how to be better than the average? – To detect and fix failures before they get notified by GOD Team and a ticket is raised – Support site admins on remedy actions – Suggest known well-working practices expertise sharing – Knowledge comes out of the mind with pain despite saving a lot of time while at work it needs a lot of encouragement for people to do so EGEE-II INFSO-RI-031688 CGW’06; Cracow; 15-18th October 2006 12 Enabling Grids for E-sciencE • Try to monitor as much functionality as possible – – • To let him convince at once how good the workaround is working Smart testing hierarchy Monitors CE Core Services – • Do not send notification until notified Allow site admin to schedule extraordinary check at will – • • Do not spam each 5 minute Allow site admin to tell the problem is being worked on – • E.g. all machines certificates expiration date Reasonable probe frequency Send a problem notification immediately but… – • Stable infrastructure - monitoring with NAGIOS added tests for checking RB, BDII, LFC, VOMS Used by 1st line support – – – Overview of the region Detailed check of services Schedule checks when working on fixes EGEE-II INFSO-RI-031688 CGW’06; Cracow; 15-18th October 2006 13 Operations metrics results Enabling Grids for E-sciencE Functional test failure % ratio EGEE Operations metrics results from last 10 months 9 8 % of failures 7 6 5 EGEE CE 4 Best player 3 Time unavailable % ratio 2 9 1 8 Jan 06 Feb 06 Mar 06 Apr 06 May 06 Jun 06 Jul 06 Aug 06 Sep 06 7 % of time 0 Dec 05 6 EGEE CE Best Player 5 4 3 2 Data from EGEE CIC portal: https://egee.in2p3.fr/CIC/index.php?id=cic&subid=cic_roc_metrics&sc ope=project&project=&metrics=sft EGEE-II INFSO-RI-031688 1 0 Dec 05 Jan 06 Feb 06 Mar 06 Apr 06 May 06 Jun 06 Jul 06 Aug 06 Sep 06 CGW’06; Cracow; 15-18th October 2006 14 Conclusions Enabling Grids for E-sciencE • CYFRONET gained the know-how on: – – – – – – Coordination of a large initiative Organization of work for different subtasks Running a stable production infrastructure Accurate Grid job accounting Sensible and precise Grid infrastructure monitoring Facilitating the application users introduction to Grid • Experience gathered in CE ROC may easily be re-used in building national Polish grid EGEE-II INFSO-RI-031688 CGW’06; Cracow; 15-18th October 2006 15 Ogólnopolska infrastruktura gridowa PL-Grid Zespół Akademickiego Centrum Komputerowego CYFRONET AGH Kraków, czerwiec – wrzesień 2006 W poniższym opracowaniu przedstawiono motywację, cele, koncepcję i sposób podejścia do utworzenia narodowej infrastruktury gridowej, niezbędnej dla nowoczesnego prowadzenia badań naukowych (e-Science), spójnej z infrastrukturą europejską. PL-Grid jako infrastruktura dla e-Science Aktualnie prowadzenie badań naukowych wymaga wykorzystania zaawansowanych technologii informatycznych. Rośnie liczba zespołów naukowych, które intensywnie ze sobą współpracują, a do tego niezbędne są narzędzia informatyczne umożliwiające gromadzenie i wymianę uzyskanej wiedzy w skali globalnej. Wyniki eksperymentów to olbrzymie, rozproszone zbiory danych o różnorodnej strukturze, których opracowanie wymaga narzędzi dostępu, ich integracji oraz przetwarzania danych. Symulacja komputerowa jest w pełni akceptowaną metodą badawczą i coraz częściej łączone są ze sobą wyniki uzyskane z symulacji i eksperymentów. Takie nowatorskie podejście jest najbardziej widoczne w fizyce wysokich energii, w astrofizyce, naukach biologicznych i medycznych, w naukach o Ziemi. Dla realizacji tego nowego paradygmatu prowadzenia badań naukowych, zwanego e-Science, jest niezbędna infrastruktura gridowa (zwana też Cyber-Science Infrastructure), obejmująca oprogramowanie umożliwiające współdzielenie różnych zasobów komputerowych oraz narzędzia wspierające współdziałanie partnerów w ramach tzw. wirtualnych organizacji. Rys1. PL-Grid jako infrastruktura dla e-Science PL-Grid, Warszawa, 22.09.2006 16 Uproszczona architektura PL-Gridu Użytkownicy Warstwa dostępowa/ tworzenia aplikacji Portale gridowe, narzędzia programistyczne Nutzer Zarządzanie zadaniami Usługi gridowe Monitorowanie Zarządzanie danymi Podstawowe usługi gridowe Zasoby gridowe Zarządzanie wirtualnymi organizacjami LCG/gLite (EGEE) UNICORE (DEISA) Globus System bezpieczeństwa Rozproszone repozytoria danych Krajowa sieć komputerowa PL-Grid, Warszawa, 22.09.2006 Rozproszone zasoby obliczeniowe 17 Struktura organizacyjna PL-Gridu Informacja Zarząd Konsorcjum Propozycje (Koordynator + członkowie) Rada Użytkowników Raporty Zalecenia Rada Konsorcjum Koordynacja Gridy dziedzinowe Centrum Operacyjne Ocena PL-Grid Infrastruktura (sprzęt, sieć) PL-Grid, Warszawa, 22.09.2006 18 Harmonogram prac Miesiące Temat 0 3 6 9 12 15 18 21 24 27 30 33 36 Przygotowanie i zatwierdzenie projektu Organizacja konsorcjum Zatrudnienie pracowników Zakupy urządzeń Infrastruktura badawczo-szkoleniowa Infrastruktura produkcyjna Rozwój oprogramowania Szkolenia gridowe Przeglądy działalności faza testowa faza pilotowa faza utrzymania i rozwoju PL-Grid, Warszawa, 22.09.2006 19
© Copyright 2025 Paperzz