Presentation - ACK Cyfronet AGH

Enabling Grids for E-sciencE
Operating
Central European EGEE ROC
Marcin Radecki, Tomasz Szepieniec, Aleksander Kusznir
and Marian Bubak
ACC CYFRONET AGH
www.eu-egee.org
CGW’06
EGEE-II INFSO-RI-031688
17 October 2006
EGEE and gLite are registered trademarks
Outline
Enabling Grids for E-sciencE
• Introduction
– EGEE and Central European (CE) Region
• Challenges for CE Regional Operating Centre
– Applications & Users
– Cooperation
– Grid Infrastructure
• Conclusions
EGEE-II INFSO-RI-031688
CGW’06; Cracow; 15-18th October 2006
2
EGEE – Community
Enabling Grids for E-sciencE
•
•
•
•
Possibly largest production infrastructure
spans over 32 countries
c.a. 200 sites grouped under 11 ROCs
Scientific community involves over
2000 people
EGEE’06 conference in Geneva
– 700 attendees,
– 32 „partner” projects present
ID
EGEE-001
EGEE-002
EGEE-003
EGEE-004
EGEE-010
EGEE-014
EGEE-039
EGEE-040
EGEE-042
EGEE-065
EGEE-066
Name
Discipline
Atlas
Physics
Alice
Physics
LHCb
Physics
CMS
Physics
ESR
Earth Sciences
Biomed
Biomed
Comp Chem
Chemistry
Magic Astro particle physics
dteam Infrastructure testing
EGEODE
Geo-Physics
Planck
Astrophysics
Total
EGEE-II INFSO-RI-031688
Discipline
Users
890
175
159
632
42
114
15
16
30
33
8
2114
VOs
CGW’06; Cracow; 15-18th October 2006
3
Central European Region in
EGEE
Enabling Grids for E-sciencE
•
•
•
•
7 countries, 22 sites, 1493 CPUs, 70
TB storage space
Supports 10/11 EGEE-approved + lot
of associated VOs
Site size scales from 2-3 to 300 CPUs
Need for solutions suitable for both
large computing centres and small
sites
– Maintenance model
– Skills & experience
– Scalable across a site’s resources
EGEE-II INFSO-RI-031688
CGW’06; Cracow; 15-18th October 2006
4
Challenges for CE ROC
Enabling Grids for E-sciencE
• We need to attract new users to grid and make possible their work
in the new environment in order to use the resources efficiently.
Provide the services the users require.
• Grid spans across many administrative domains, each of which
need to be active in terms of cooperation to share resources and
collaborate productively. Excellent possibility for expertise
sharing.
• Having resources is not enough; infrastructure need to be stable
before real users start to use it and we should maximize utilization
as possible.
EGEE-II INFSO-RI-031688
CGW’06; Cracow; 15-18th October 2006
5
Grid-enabling users
Enabling Grids for E-sciencE
•
Means to gain and uphold users with us
– Understand users’ needs and satisfy them
– Easy access, how-to-use documentation
(in national languages)
– Stable working environment
– User Support infrastructure
•
Results:
– Computational chemistry
 Mariusz Sterzel (CYFRONET) coordinates
computational chemistry applications in EGEE
 Enabling commercial software - Gaussian VO
 Study on pyrazoloquinolines (PQ) used for laser
light generation
– Bioinformatics
 Never Born Protein folding and function
recognition - Prof. Irena Roterman team (CM-UJ)
– Others:
 Many small teams are working
within regional catch-all VO – VOCE
EGEE-II INFSO-RI-031688
CGW’06; Cracow; 15-18th October 2006
6
VOs in the Region
Enabling Grids for E-sciencE
•
•
Supported VOs list
alice, atlas, auger, balticgrid, belle
biomed, cms, compass, compchem,
crogrid, esr, euchina., gamess.
gaussian, geant4, gear, geclipse,
hone, hungrid, lhcb, magic, ops,
skgrid, voce, vocet, zeus
Service/Data Challenges and test
productions
– Atlas Service Challenge 4
– World-wide In Silico Docking On
Malaria data challenge 1st and 2nd
(ongoing)
– EGEE-ITU
 International digital broadcasting
agreement – new frequency plan
 compatibility and complementary
analysis
EGEE-II INFSO-RI-031688
CGW’06; Cracow; 15-18th October 2006
7
Managment of CE ROC
Enabling Grids for E-sciencE
• ROC Manager
– Represents the region at the level of
the Project managerial bodies
– Supervises all Service Activities
• Operations
– Coordinate actions related to
infrastructure and middleware
– Escalates unsolvable problems
level higher
– Fit the Project requirements into the
region
• User Support
– Provides support tools for users
– Takes part in shifts handling all user
tickets in GGUS system
• Security
– Incident handling procedures
– Incident response team
EGEE-II INFSO-RI-031688
ROC
Manager
User Support
Responsible
Operations
Responsible
Security
Responsible
1st Line
Support
Core Grid
Services
Regional Certification
of Middleware
Grid Operator
On Duty
Pre-Production
Service
CYFRONET
IISAS/PSNC
CESNET/PSNC
ICM WARSAW
CGW’06; Cracow; 15-18th October 2006
8
Procedures and Commitments
Enabling Grids for E-sciencE
• Well defined procedures makes collaboration more efficient
– Clear paths on how we deal with things to avoid misunderstandings
– Newbies are always there
– People tend to forget things over the time
• Procedures examples:
–
–
–
–
New site registration
New site admin joining
Site problem handling
Sending Weekly Reports
• Commitments monitoring
makes people more motivated
EGEE-II INFSO-RI-031688
CGW’06; Cracow; 15-18th October 2006
9
Operations - coordinate the work
Enabling Grids for E-sciencE
• Operations is the most time consuming task
– To make sure that operational procedures are understood and followed up
properly
– To ensure production requirements are met at the sites
– To work out best solutions for problems
– To understand expectations/needs
– To make sure problems are being solved in a proper way
– To ensure weekly reports are completed and sent
• Three styles of site administration observed
– Keep all services ready all the time – „I’m the best admin in the city”
– React only when gets a problem report – „I’m a bit occupied”
– React only if my name appears on a „black list”, available to the public – „I’m
hard-working on… something important”
EGEE-II INFSO-RI-031688
CGW’06; Cracow; 15-18th October 2006
10
Resources and their usage
Enabling Grids for E-sciencE
• Accounting in EGEE
– July-October ’06 - over 672k
CPU hours computed in CE
region; equivalent of 275
CPUs running 24x7
– Problems with „missing” data
– Update rate: daily
•
Max. CPUs
Our approach to accounting
– Site performance efficiency
study:
- Up-to-date information on
what is going at a site,
- Maximize site utilization
Jobs Executing
Jobs Queued
 better to have jobs queued at
a site than idle CPUs
– Is being extended towards a
new system for fine grain
accounting
EGEE-II INFSO-RI-031688
Avoid low usage periods
CGW’06; Cracow; 15-18th October 2006
11
Stable infrastructure
- social aspect
Enabling Grids for E-sciencE
• How EGEE keeps the Grid stable
– Grid Operator on Duty (GOD) watching entire grid
 CE joined this activity in a first turn in EGEE-II
– Raise a ticket for each detected problem
– Problem diagnosis and solution suggestion
– Use monitoring tools for problem detection and availability metrics
• 1st Line Support in CE - how to be better than the average?
– To detect and fix failures before they get notified by GOD Team and a ticket
is raised
– Support site admins on remedy actions
– Suggest known well-working practices  expertise sharing
– Knowledge comes out of the mind with pain  despite saving a lot of time
while at work it needs a lot of encouragement for people to do so
EGEE-II INFSO-RI-031688
CGW’06; Cracow; 15-18th October 2006
12
Enabling Grids for E-sciencE
•
Try to monitor as much functionality
as possible
–
–
•
To let him convince at once how good the
workaround is working
Smart testing hierarchy
Monitors CE Core Services
–
•
Do not send notification until notified
Allow site admin to schedule
extraordinary check at will
–
•
•
Do not spam each 5 minute
Allow site admin to tell the problem is
being worked on
–
•
E.g. all machines certificates expiration
date
Reasonable probe frequency
Send a problem notification
immediately but…
–
•
Stable infrastructure
- monitoring with NAGIOS
added tests for checking RB, BDII, LFC,
VOMS
Used by 1st line support
–
–
–
Overview of the region
Detailed check of services
Schedule checks when working on fixes
EGEE-II INFSO-RI-031688
CGW’06; Cracow; 15-18th October 2006
13
Operations metrics results
Enabling Grids for E-sciencE
Functional test failure % ratio
EGEE Operations metrics
results from last 10 months
9
8
% of failures
7
6
5
EGEE
CE
4
Best player
3
Time unavailable % ratio
2
9
1
8
Jan
06
Feb
06
Mar
06
Apr
06
May
06
Jun
06
Jul
06
Aug
06
Sep
06
7
% of time
0
Dec
05
6
EGEE
CE
Best Player
5
4
3
2
Data from EGEE CIC portal:
https://egee.in2p3.fr/CIC/index.php?id=cic&subid=cic_roc_metrics&sc
ope=project&project=&metrics=sft
EGEE-II INFSO-RI-031688
1
0
Dec
05
Jan
06
Feb
06
Mar
06
Apr
06
May
06
Jun
06
Jul
06
Aug
06
Sep
06
CGW’06; Cracow; 15-18th October 2006
14
Conclusions
Enabling Grids for E-sciencE
• CYFRONET gained the know-how on:
–
–
–
–
–
–
Coordination of a large initiative
Organization of work for different subtasks
Running a stable production infrastructure
Accurate Grid job accounting
Sensible and precise Grid infrastructure monitoring
Facilitating the application users introduction to Grid
• Experience gathered in CE ROC may easily be re-used in building
national Polish grid
EGEE-II INFSO-RI-031688
CGW’06; Cracow; 15-18th October 2006
15
Ogólnopolska infrastruktura gridowa PL-Grid
Zespół Akademickiego Centrum Komputerowego
CYFRONET AGH
Kraków, czerwiec – wrzesień 2006
W poniższym opracowaniu przedstawiono motywację, cele, koncepcję i
sposób podejścia do utworzenia narodowej infrastruktury gridowej,
niezbędnej dla nowoczesnego prowadzenia badań naukowych (e-Science),
spójnej z infrastrukturą europejską.
PL-Grid jako infrastruktura dla e-Science
Aktualnie prowadzenie badań naukowych wymaga
wykorzystania
zaawansowanych
technologii
informatycznych. Rośnie liczba zespołów naukowych,
które intensywnie ze sobą współpracują, a do tego
niezbędne są narzędzia informatyczne umożliwiające
gromadzenie i wymianę uzyskanej wiedzy w skali
globalnej. Wyniki eksperymentów to olbrzymie,
rozproszone zbiory danych o różnorodnej strukturze,
których opracowanie wymaga narzędzi dostępu, ich
integracji oraz przetwarzania danych. Symulacja
komputerowa jest w pełni akceptowaną metodą
badawczą i coraz częściej łączone są ze sobą wyniki
uzyskane z symulacji i eksperymentów. Takie
nowatorskie podejście jest najbardziej widoczne w
fizyce wysokich energii, w astrofizyce, naukach
biologicznych i medycznych, w naukach o Ziemi.
Dla realizacji tego nowego paradygmatu prowadzenia
badań naukowych, zwanego e-Science, jest niezbędna
infrastruktura gridowa (zwana też Cyber-Science
Infrastructure),
obejmująca
oprogramowanie
umożliwiające
współdzielenie
różnych
zasobów
komputerowych
oraz
narzędzia
wspierające
współdziałanie partnerów w ramach tzw. wirtualnych
organizacji.
Rys1. PL-Grid jako infrastruktura dla e-Science
PL-Grid, Warszawa, 22.09.2006
16
Uproszczona architektura PL-Gridu
Użytkownicy
Warstwa
dostępowa/
tworzenia
aplikacji
Portale gridowe, narzędzia programistyczne
Nutzer
Zarządzanie
zadaniami
Usługi
gridowe
Monitorowanie
Zarządzanie danymi
Podstawowe
usługi
gridowe
Zasoby
gridowe
Zarządzanie
wirtualnymi
organizacjami
LCG/gLite
(EGEE)
UNICORE
(DEISA)
Globus
System
bezpieczeństwa
Rozproszone
repozytoria
danych
Krajowa
sieć
komputerowa
PL-Grid, Warszawa, 22.09.2006
Rozproszone
zasoby
obliczeniowe
17
Struktura organizacyjna PL-Gridu
Informacja
Zarząd Konsorcjum
Propozycje
(Koordynator + członkowie)
Rada
Użytkowników
Raporty
Zalecenia
Rada
Konsorcjum
Koordynacja
Gridy dziedzinowe
Centrum
Operacyjne
Ocena
PL-Grid
Infrastruktura
(sprzęt, sieć)
PL-Grid, Warszawa, 22.09.2006
18
Harmonogram prac
Miesiące
Temat
0
3
6
9
12
15
18
21
24
27
30
33
36
Przygotowanie i zatwierdzenie projektu
Organizacja konsorcjum
Zatrudnienie pracowników
Zakupy urządzeń
Infrastruktura badawczo-szkoleniowa
Infrastruktura produkcyjna
Rozwój oprogramowania
Szkolenia gridowe
Przeglądy działalności
faza testowa
faza pilotowa
faza utrzymania i rozwoju
PL-Grid, Warszawa, 22.09.2006
19