2008_02_05-06_CIC_Portal_and_COD_Activities

jeudi 13 juillet 2017
CIC Portal/COD Activities
Hélène Cordier
IN2P3/CNRS Computing Centre, Lyon, France
Contents
CIC Portal Usage : who/how
 Latest Release Portal Characteristics
 On-going developments
 CIC portal overview for COD
 Statistics and results
 Working groups
 Zoom on Failover

Use tools

Each actor can use a set of operational
tools (provided, integrated or interfaced)
Communicate
USER
Tools
(CIC Portal)
Manage static
information
about my VO
VO
MANAGER
SITE
Report on site
activity,
submit tests,
configure
Track, report,
diagnose and
follow-up
problems
OPERATOR
The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007)
REGIONAL
CENTER
13/07/2017
3
What do people connect to the CIC portal for ?
Distribution
in 2005
OAG
Av connections Dec 2004-Dec 2007
home
17%
4%
1000
users
4%
900
COD
39%
VO
11%
700
600
RC
11%
500
ROC
14%
400
300
Distribution
in 2007
OAG
200
0%
100
déc-07
oct-07
août-07
juin-07
avr-07
févr-07
déc-06
oct-06
août-06
juin-06
avr-06
févr-06
déc-05
oct-05
août-05
juin-05
avr-05
févr-05
0
déc-04
number of connections
800
home
28%
COD
37%
users
1%
VO
6%
month
ROC
5%
RC
23%
Titre de l'axe
ROC
5%
200
150
100
50
0
Number of sent Broadcasts
ao 6
ût
-0
se 6
pt
-0
6
oc
t -0
6
no
v06
dé
c0
jan 6
v07
fé
vr
m 07
ar
s-0
7
av
r-0
7
m
ai
-0
7
ju
in
-0
7
ju
il07
ao
ût
-0
se 7
pt
-0
7
oc
t -0
7
no
v07
dé
c0
jan 7
v08
60
l-0
06
RC
23%
n-
users
1%
VO
6%
ju
i
COD
37%
home
28%
ju
i
0%
ju
in
ju 06
i
ao l-06
û
se t-06
pt
oc 06
t
no -06
v
dé -06
jan c-06
v
fé -07
v
m r-0
ar 7
sav 07
r
m -07
ai
ju -07
in
ju 07
i
ao l-07
ût
se -07
pt
oc 07
t
no -07
v
dé -07
jan c-07
v08
m
ar
s-0
av 5
r
m -05
ai
ju 05
in
ju 05
ilao 05
ût
se -05
pt
oc 05
t
no -05
vdé 05
c
jan -05
v
fé -06
v
m r-0
ar 6
s-0
av 6
r
m -06
ai
ju 06
in
ju 06
il
ao -06
ût
se -06
pt
oc 06
tno 06
vdé 06
c
jan -06
v
fé -07
v
m r-0
ar 7
s-0
av 7
r
m -07
ai
ju -07
in
ju 07
il
ao -07
ût
se -07
pt
oc 07
tno 07
vdé 07
c
jan -07
v08
Connections and process
Distribution
in 2007
OAG
Total nb of registered VOs
140
120
133
100
80
40
60
20
0
250
New registrations
20
18
16
14
12
10
8
6
4
2
0
Tasks handled by CIC portal Development team
Between October 2006 and February 2007
Task repartition per type
Internal tools &
synchronization
18%
High level or
political action
9%
Task repartition per origin of the request
Failover
OCC 7%
8%
Incidents and
Bug fixing
25%
Others
2%
internal
28%
OAG + VOs
13%
Technical
investigation
5%
Tests and
verifications
7%
Development of
new features
6%
Improvement of
existing features
30%
ROCs
17%
COD
25%
Between February 2007 and January 2008
Task repartition per type
Internal tools &
synchronization
18%
Technical
investigation
6%
Task repartition per origin of the request
High level or
political action
Incidents and
5%
Bug fixing
20%
Tests and
verifications
12%
Improvement of
existing features
20%
Development of
new features
30%
Others
15%
internal
17%
Failover
4%
OCC
12%
OAG + VOs
17%
COD
25%
ROCs
10%
Contents
CIC Portal Usage : who/how
 Latest Release Portal Characteristics
 On-going developments
 CIC portal overview for COD
 Statistics and results
 Working groups
 Zoom on Failover

Latest changes in 6 months

Last technical changes
– authentication is now based on full certificate DN instead of CN

Work on VO ID cards
–
–
–
–
changes in Database schema for VO/VOMS information
VO ID card interface improved
Integration of the YAIM VO Configurator to the CIC portal
Downloadable XML dump of VO ID card info

Scheduled downtimes procedure

Integration of the regional 1rst line support dashboard – prototype with
CE
On-going developments
CIC Portal Usage : who/how
 Latest Release Portal Characteristics
 On-going developments
 CIC portal overview for COD
 Statistics and results
 Working groups
 Zoom on Failover

What is left for next release in March

2159 Adapt to new components released into
production, cf YAIM tool.
 1559 Development of a new version report taking
into account several feedback.
 1920 Follow SAM migration to gridview on CIC
portal side  IDLE
 Internal Tasks include quick fixes/bug fixes,
documentation, background clean-up work, code
optimization/prospective for EGEE-III.
COD activity
CIC Portal Usage : who/how
 Latest Release Portal Characteristics
 On-going developments
 CIC portal overview for COD
 Statistics and results
 Working groups
 Zoom on Failover

ARM Meeting, EGEE’07, Budapest
13/07/2017
11
A tool for Grid Operators: COD
dashboard
Sites info
Monitoring tool #1
Operato
r
Monitoring tool #2
Sites info
Operato
r
Monitoring tool #n
Mail client
Monitoring tool #2
Monitoring tool #n
Dashboard
Monitoring tool #1
Mail sender
Ticketing system
Ticketing system
MANY ENTRY POINTS
SINGLE ENTRY POINT
Start of EGEE
The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007)
Now
13/07/2017
12
Interaction with EGEE services
IN2P3-CC, Lyon,
France
OPERATIONS PORTAL
- View ticket
GGUS
SOAP
- Create ticket
- Update ticket
Site1
Site2
Site3
Site4
status
status
status
status
status
status
status
status
ticket #28
ticket #32
No ticket
ticket #14
http
FZK, Karlsruhe, Germany
GOC-DB
- Site info
- Scheduled
downtimes
GIIS status
per site
Test results
on nodes
SAM
CERN, Geneva, Switzerland
The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007)
Gstat
ASGC, Taipei, Taiwan
13/07/2017
13
Outline
CIC Portal Usage : who/how
 Latest Release Portal Characteristics
 On-going developments
 CIC portal overview for COD
 Statistics and results
 Working groups
 Zoom on Failover

The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007)
13/07/2017
14
Statistics
Proportion of COD tickets against GGUS tickets for all ROCs
800
700
600
500
Tickets opened by COD teams
400
Tickets opened through GGUS
All GGUS tickets
300
200
100
0
31-juil.
31-août
30-sept.
31-oct.
30-nov.
31 Dec
% of opened
tickets
CE
SE
SRM
RGMA
sBDII
October
39
15
14
11
6
cod tickets
269
268
228
November
34
14
18
6
10
ggus tickets ass. To ROCs
277
281
307
December
29
18
21
9
8
ALL SU
364
427
709
Solution time [hours]
Oct
Nov
Dec
CIC Portal Usage : who/how
 Latest Release Portal Characteristics
 On-going developments
 CIC portal overview for COD
 Statistics and results
 Duties and Working groups
 Zoom on Failover

The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007)
13/07/2017
16
COD Duties

Rotations of 10 federations/teams -1/5 weeks.
 Quarterly face-to-face meetings to update tools,
procedures and uniformize working habits.
===================================
 10 federations over 18 months in EGEE-I
 Working groups for over 18 months now….
There is more to it ….
Straightforward mandate working groups:
-
GSTAT -- TW,
SAM -- CERN,
SAMAP – CE, topped by
Tools for Improvement for COD, TIC – CE
(EGEE’07)
Working groups mandate
-
Integration of the existing tools CIC– FR
Integration platform of all COD tools to ease-up the daily operational
job
-
Improvement of BEST PRACTICES -- DE-CH
Identifity, raise and analyse with COD how to have homogeneous
operations 
Release of updated documentation OPM –SE
Documentation under constant evolution
-
-
Set-up of Failover Mechanisms for GRID CORE SERVICES – SWE,
What is done at a federation level, what is done at the project level
(need help from JShiers group), what could be done (operational point
of view) and what is needed at the ROC/Site level (from a m/w point of
view).
-
Set-up of High Availability strategy of the operational tools for CODs
FAILOVER– IT
Failover working group
CIC Portal Usage : who/how
 Latest Release Portal Characteristics
 On-going developments
 CIC portal overview for COD
 Statistics and results
 Working groups
 Zoom on Failover for Operational Tools

The 8th IEEE/ACM International Conference on Grid Computing (Grid 2007)
13/07/2017
20
EGEE Failover: purpose

Propose, implement and document failover procedures for the
collaboration, management and monitoring tools used in
EGEE/WLCG Grid.
– Solution is based on DNS and consists in:
• mapping the service name to one or more destinations
• update this mapping whenever some failure is detected

Geographical failover for the EGEE-WLCG Grid collaboration tools
– CHEP 2007, Victoria BC, Canada (September 2007)
COD Work aspects to keep in EGEE IIII

Dedication : Working groups recognized within federations to provide
expertise and by federations to make the needs come to the central
operations.

Collaboration : Up to now, each federation had found a way to contribute
actively to improve their COD work environment, when not proactively leading
a working group.
Also, each person/tool developper/expert recognized as of « global interest »
eventhough out of COD scope has been integrated happily in this « closed
community », e.g SAMAP  TIC scope to monitor this aspect with Nagios
prototype for example.

Flexibility : Purpose of the groups to evolve together with their mandate with
time and the upcoming of the needs e.g. Core grid services HA, EGI

Anticipation : e.g. Strategy of the Operational Failover Working Group.

Experiment : e.g regionalisation of tools and the future modular « NGI
dashboards » to widen the CE 1rst line support experience.
COD Work aspects to make evolve in EGEE IIII

Mandate and Assessment of the COD activity
 Integration of NDGF/NE as a COD team – other teams ?
 Catch-all and global operations center
-- what core services are to be monitored centrally , and how to monitor them
and how to properly switch to backup
-- How to aggregate local data and what local data would be concerned
 Assess metrics in order to assess the most problematic m/w components,
recurrently unreliable sites
 Operational tools reliability assessment /ENOC test as a start base?
 Strenghten need on HA/Failover of operational tools and grid core services

Vision of the COD tools long-term evolution : 1 set of tools /federation +
aggregation?
Which set of tools is to be regionalized ? SAM, GOC DB, COD? what else?
How are they going to interact => need for a global schema, NOW.
COD Work aspects to make evolve in EGEE IIII

Leverage on « project labeled » tools in order for operational use-cases for not
to remain « pending ».
 developements strategy/priorities are coherent.
-- data workflow – synch GOCDB/BDII/SAM/COD
-- development strategy – depends on the stretegy of the COD tools long-term
evolution
-- priority decision workflow – Who and how to drive the « project labeled »
tools requests priority for operational use-cases for not to remain « pending ».
- critical tests monitoring/accounting or ARC CE.
- ca update procedure,
- need for SAM failover…
 staffing is adequate for proper reactivity not only for bugfix.

Interoperability/interoperations (item to be followed up)
– OSG : rather informal for the moment, BUT NOW, users do have
problems and sites are the relay of their users cf GGUS ticket 31037.
– NDGF : existing critical test monitoring ? and what are the consequences
on operational procedures?
Conclusions and References
Where, how, when do we adress these topics??
Some can be adressed here or can be thought at at COD meetings,
some are relevant to OCC/ROC first and COD working groups can
then make suggestions/recommendations.
References:
CIC portal: a Collaborative and Scalable Integration Platform for High
Availability Grid Operations
Grid 2007 (IEEE), Austin Tx, United-States (September 2007)
Geographical failover for the EGEE-WLCG Grid collaboration tools
CHEP 2007, Victoria BC, Canada (September 2007)