LHCOPN Operation status - Indico

LHCOPN: Operations status
Guillaume.Cessieux @ cc.in2p3.fr
Network team, FR-CCIN2P3
LHCOPN meeting, Barcelona, 2010-06-29
Outline

Operations status
– TTS stats
– Change management
– Backup tests

Ongoing
– Relationships with WLCG
– Around GGUS
GCX
LHCOPN meeting, Barcelona, 2010-06-29
2
What was reported in the TTS?

395 tickets in the TTS since 2009-02
– 381 solved (96%)
– 7 in progress
• Normal ongoing issues or scheduled work
– 5 unsolved
• Mainly performance issue not understood
• Duplicate or erroneous tickets
• cancelled or postponed work
– 2 assigned
• Twiki review pending (CA-TRIUMF, NDGF)
GCX
LHCOPN meeting, Barcelona, 2010-06-29
3
5 long standing issues

1 infrastructure
– #55697: 2010-03-10, FR-CCIN2P3, BGP flapping with CH-CERN
• Ongoing issue, root cause not yet found, ~1 flap/day, not service affecting

4 administratives
– #48335: 2009-04-30, Additional prefix for CA-TRIUMF
• Missing notification of acceptance from NDGF, UK-T1-RAL and US-FNALCMS
– #52959: 2009-11-04, UK-T1-RAL, Review of LHCOPN twiki
• Only missing routing policies to be udpated
– #56415: 2010-03-12, NDGF, Review of LHCOPN twiki
• Not started
– #56417: 2010-03-12, CA-TRIUMF, Review of LHCOPN twiki
• Not started

GCX
Ops phoneconf seems not so successful to get this solved
LHCOPN meeting, Barcelona, 2010-06-29
4
Overall breakdown per category and type of problem
80% of tickets are
L2 related events
GCX
LHCOPN meeting, Barcelona, 2010-06-29
5
Number of tickets put in the TTS per month
AVG: 23 tickets/month
GCX
LHCOPN meeting, Barcelona, 2010-06-29
6
Ticket’s ownership per site
Nearly 1/4th of tickets
NL-T1 has 6 LHCOPN links
GCX
LHCOPN meeting, Barcelona, 2010-06-29
7
Ownership of tickets per month per site
GCX
LHCOPN meeting, Barcelona, 2010-06-29
8
Kind of tickets per month
GCX
LHCOPN meeting, Barcelona, 2010-06-29
9
KPI-1: Infrastructure vs operations behavior
Less than 15 “significant” events / month?
GCX
LHCOPN meeting, Barcelona, 2010-06-29
10
Change management

Only 5 tickets flagged as « change » !
– Is the infrastructure that stable?

GCX
Flag set on GGUS submit interface
LHCOPN meeting, Barcelona, 2010-06-29
11
Conclusion on TTS stats
L2 events are regular then well managed
 NL-T1 seems to have a very good
implementation of the Ops model
 Administrative stuff frozen

– Twiki review, change management etc.
• Not fascinating but minimum vital

Decrease in the monthly number of tickets
– Feeling from sites that not all tickets are useful
– Need to ensure minimum vital is here by
correlating with monitoring
GCX
LHCOPN meeting, Barcelona, 2010-06-29
12
Backup tests?

Previously agreed: Each resilience possibility
should be demonstrated at least once a year
– Failures can count as a test if they are properly
reported (particularly paths’ symmetry)

Only two sites have reported a backup test or
a demonstration of backup efficiency for 2010
• https://twiki.cern.ch/twiki/bin/view/LHCOPN/LhcopnBackupTestsResults2010

GCX
No recent change in the infrastructure so no
need to test?
LHCOPN meeting, Barcelona, 2010-06-29
13
Following MDM deployment related issues

Only deployment issues?
– Physical set up etc.
– Interaction with sites

Should be tracked through tickets
– Still in GN3 helpdesk system?
• GN3 people have no access to GGUS
• LHCOPN people have no access to GN3 helpdesk

Should be visible
– How, where?
GCX
LHCOPN meeting, Barcelona, 2010-06-29
14
What’s missing to go ahead?

Network SLD
– What is a « significant » event requiring care etc.

Monitoring
– Have we service impacting events?
– Correlation with Operations
– Evidences instead of feelings
• Particularly for performance issues

Fill the gap between WLCG Ops and LHCOPN
Ops
– Gap by design but bridge expected
GCX
LHCOPN meeting, Barcelona, 2010-06-29
15
Relationships with WLCG (1/4)

Lot of work previously done by Wayne
– Clear overview during Vancouver’s presentation
• http://indico.cern.ch/materialDisplay.py?contribId=17&materialId=slides&confId=59842
• Agreement from WLCG about!
• Only missing careful implementation?

Minimum relationships should be made of
– Exchanges during meetings
– Operational exchange
• Clear process and KPI around
– Facilitated with tickets’ linking
• Dashboard of service affecting issue
– Sharing LHCOPN monitoring information
GCX
LHCOPN meeting, Barcelona, 2010-06-29
16
Relationships with WLCG (2/4)

Main stoppers
– Meetings
• Not acting and represented as a whole community through a LHCOPN
representative or “liaison officer”
• Too often asked to be there « just in case »
– Operational exchanges
• Complex and hard to get used to them with very few issues involving
WLCG (~1 each 3 months?)
– Post mortem analysis hard as a lot of exchanges seems off the record
– Now high resiliency network
– A lot of things are site’s internal processes
• Common use of GGUS is giving a false feeling of relationships
– We are not doing user support!
• Mistake to assume we can handle all network issue from our isolated
island with our closed set of supporters
– Need coordination and action from other teams (storage…)
– Problem to interact with WLCG supporters
GCX
LHCOPN meeting, Barcelona, 2010-06-29
17
Relationships with WLCG (3/4)
Experiment

Sample expected
workflow for WLCG
inquiries:
WLCG GGUS
Site
Contact
Yes
Networking?
Site Network
Team
Yes
LHCOPN
Related?
GCX
Relevant
WLCG
Team
No
LHCOPN
GGUS
Internal Ticket
System
Relevant
Network
Team
Site Network
Team
LHCOPN
GGUS
Internal Ticket
System
Site
Contact
No
WLCG GGUS
WLCG GGUS
LHCOPN meeting, Barcelona, 2010-06-29
18
Relationships with WLCG (4/4)

Workplan
– A dashboard showing tickets impacting WLCG
• Done: Particular view on the dashboard
– Ability to link WLCG and LHCOPN tickets
• Upcoming: Parent/Child relationship
– Cross reference still here (no associated workflow)
• But problem to interact with WLCG supporters
– No cross helpdesk access to update tickets
– On site processes
• Push for carefull implementation of « Site’s contact »?
– Internal site’s processes
GCX
LHCOPN meeting, Barcelona, 2010-06-29
19
Around GGUS (1/6): GGUS status list
GCX
LHCOPN meeting, Barcelona, 2010-06-29
20
Around GGUS (2/6): LHCOPN submit interface
GCX
LHCOPN meeting, Barcelona, 2010-06-29
21
Around GGUS (3/6): WLCG submit interface
GCX
LHCOPN meeting, Barcelona, 2010-06-29
22
Around GGUS (4/6): Merging Pros

Should we unify/merge LHCOPN helpdesk within the standard
GGUS?
+ Consider networks like other resources (computing, storage, software...)
• Network are not standalone resource, coordination between sites required
+ Maybe better fit in reporting reports
• True
+ Now standard way to send enquiries to sites?
• Yes for Grid issues, not always for network teams, less Grid centred, unwilling to go at
project level
• But for a project’s dedicated network?
+ Maybe some central manpower could be gained
+ Regularly chasing pending tickets...
• Very unclear who can do that, and if this will be successful (cf. twiki review)
+ Less specific software and support from GGUS
• No key economy for them: Still using same database, hosts etc. and sharing some code
+ Ease interactions with WLCG supporters
• Issues evolving in two different worlds
• Write access to our helpdesk restricted to network teams
GCX
LHCOPN meeting, Barcelona, 2010-06-29
23
Around GGUS (5/6): Merging Cons
– We have something stable and working
• Definitely, but that should not prevent improvements
– Completely tailored for us and closely matching our
operational model
• Seems hard to merge frontends and unify workflows
– Be far from interferences with Grid world
• Isolation could be achieved with particular views?
• Was a key concern from network teams
– Not shaped to do user support
• But coordinating network teams
• Maintenance not in GGUS
• No strong preference from the GGUS team
• Confirmed, not a problem for them
GCX
LHCOPN meeting, Barcelona, 2010-06-29
24
Around GGUS (6/6): Conclusion about merging

Our helpdesk was designed to coordinate network teams
not to support WLCG users
– Really different from standard GGUS
– Appears as an internal coordination tool

Benefits not so clear, was mainly thought to ease
integration in WLCG Ops
– But we are not doing Grid Ops
• Network issue ≠ Grid issue
• Networks are not standalone resources (storage, cpu etc.)
– Similar to software issues handled externally (in savannah)
• We should not be customer faced
– Selected inquiries going through storage teams (“Site contact”)

GCX
Let’s also see how EGI will converge around user support
LHCOPN meeting, Barcelona, 2010-06-29
25
Conclusion about LHCOPN Operations

Ops status: Clear place for improvements
– Unequal following of processes by sites because missing clear
feeling of usefulness and evidence of network failures
– L2 events well handled while administrative workflow is forgotten

WLCG relationships to be implemented and nurtured
– Performance issues need smart and timely solving
– Skeleton of coordination with WLCG Ops to be improved

No outstanding benefit to unify LHCOPN helpdesk with WLCG’s one
– Maybe better and enough to carefully link our workflow with WLCG Ops

Wait monitoring & SLDs before next set of improvements
– Timeline?
– Particularly revitalise tickets’ handling and ensure minimum is here
GCX
LHCOPN meeting, Barcelona, 2010-06-29
26
Questions
1.
Pushing for administrative things to be
done?
– twiki review, backup tests etc.
2.
LHCOPN representative?
– Maybe not responsible for Ops but more
liaising as a single contact point
– Share and justify the workload
3.
GGUS merging
– Opinion?
GCX
LHCOPN meeting, Barcelona, 2010-06-29
27