Slides - Indico

EGI-InSPIRE
EGI Network Support
task force
Mario Reale IGI / GARR
[email protected]
January 24, 2011
EGI OMB f2f meeting
Amsterdam EGI.eu
EGI-InSPIRE RI-261323
1
www.egi.eu
Overview
EGI-InSPIRE
• Introduction to the Task Force
• Definition of the identified use cases
• Answers from the NGI
EGI-InSPIRE RI-261323
2
www.egi.eu
Goals and duration
• Mandate: assessment of the current stand of Network
Support for EGI and the formulation of a proposal for it
–
–
–
–
–
–
Gather user requirements from NGIs
Assess the status of the available tools
Further develop and consolidate new proposed tools
Identify missing bits / tools
Propose tools and workflows to the EGI Net Sup community
Define draft workplan for the next months
• Started on October 20, 2010, ended on January 21, 2011
– around 8 working weeks duration
– coordinated from remote
• met 5 times in VideoConference: 20/10, 10/11, 22/11,10/12, 14/1
EGI-InSPIRE RI-261323
3
www.egi.eu
Membership
•
•
•
•
•
•
•
•
•
•
•
Etienne Duble France-Grille (UREC CNRS)
Xavier Jeannin France-Grille (UREC CNRS)
Esther Robles (RedIRIS)
Alberto Escolano (RedIRIS)
Bruno Hoeft (D-GRID KIT)
Mario Reale (IGI GARR)
Fulvio Galeazzi (IGI GARR)
Alfredo Pagano (IGI GARR)
Wenshui Chen (ASGC)
Domenico Vicinanza (DANTE Int.Rel.Team)
Szymon Trocha (PSNC/GN3 SA2 T3 PerfSONAR)
EGI-InSPIRE RI-261323
4
www.egi.eu
What has been done
• Identified 7 network related Use Cases
• Organized a questionnaire about them for the NGIs,
gathered and published the results
• Identified a strategy for all of them
– although we specified strategies at different levels of
accuracy and technical insight
• Some of us worked on further development of tools
– PerfSONAR live-CD, HINTS, NetJobs
• Designed the GGUS network support workflow to be
implemented for EGI
• Liaised with GN3 about the current PerfSONAR
status/tools
EGI-InSPIRE RI-261323
5
www.egi.eu
What has NOT been done
• Brought all proposed new tools to a final, frozen
production status after extensive validation phase
– But all proposed tools can usefully be used by early
adopters
• Made a world-wide, general assessment of all
available tools for network monitoring and network
support in general
• Developed new tools in all cases we felt either a
brand new tool or a major improvement of the existing
ones would be required
– Example: Network-related Scheduled Maintenances
EGI-InSPIRE RI-261323
6
www.egi.eu
• Identified Use Cases (7)
• Answers from the NGI Questionnaire
EGI-InSPIRE RI-261323
7
www.egi.eu
GGUS
• Grid Users and Site Administrators open a ticket in
the GGUS support system when they think a
network issue is behind the problems they are
experiencing. Tickets are assigned to the GGUS
Network Support Unit and processed until solved.
• We need to give a home to all network related
issues in EGI – currently unattended
• To whom assign network related issues ?
– A support team made by network experts from
volunteering NGIs or NRENs ?
– Skip the Grid community and assign tickets directly to the
NRENs and/or GEANT/DANTE ?
• Many parties involved in ticket processing: Site
Admins, NREN NOCs and APMs, GEANT NOC and
APMs
EGI-InSPIRE RI-261323
8
www.egi.eu
Answers on GGUS
GGUS
Provided Answer Type (n.)
1
22
21
2
3
4
20
4
3
19
14
5
5
2
1
18
6
0
GGUS
17
12
7
16
8
15
9
14
10
10
13
11
12
8
GGUS: answer from each NGI
Provided Answer Type (n.)
6
6
5
4
4
2
3
GGUS
2
0
1
1
2
3
4
5
0
1
2
3
4 5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22
Answer n.3:
Having a GGUS support unit for Network Support is useful,
but tickets should be handled automatically according to a given workflow
and routed to NRENs/NGIs contacts; no need to have a permanent
team behind this unit
EGI-InSPIRE RI-261323
9
www.egi.eu
EGI PERT
• Grid Users experiencing poor performances in data transfers
can refer to a global EGI PERT Contact Team (with both Grid
Middleware/Applications and Network Know-How) to get
support
• The idea would be to have EGI-wise a unique team of experts
with both Grid Middleware/Applications and Network knowhow (merging the 2 communities)
• Expensive idea, but useful:
– bottleneck identification involve digging into both domains and its
interface/interaction
– Middleware and Application experts (VO,VRCs) could start excluding
higher level issues in the ISO/OSI stack before NRENs and Federated
EduPERT networking experts come in
• It turned out to be too expensive for the NGIs’
manpower/budget – at least at this stage
EGI-InSPIRE RI-261323
10
www.egi.eu
Provided Answers on
EGI PERT
PERT: answer from each NGI
Provided Answer Type (n.)
6
5
4
16
3
14
PERT
2
12
1
10
0
1
8
2
3
5
4
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22
Provided Answer Type (n.)
PERT
6
1
4
21
22
20
2
19
18
0
1
2
3
4
5
5
4
3
2
1
0
2
3
4
5
6
17
7
16
8
15
Answer n.4:
PERT
9
14
13
11
10
12
Having a Global EGI PERT access point for users experiencing
poor performances – forming a PERT Team with Grid-added know how –
is useful, but we cannot commit any resource/manpower to it
EGI-InSPIRE RI-261323
11
www.egi.eu
Scheduled Maintenances
• When an identified accident or the scheduled maintenances of
network devices/PoPs is impacting on a Grid resource
center/site, users, site admins and Operations teams are
warned in advance (Sched Maint) or informed asap (Accident)
• The idea would be inform users/site Admins about why things
are not working when there are obvious reasons for
experiencing problems – Currently GOCDB is used for Gridrelated Sched M.
• Requires NREN-NGI communication/coordination:
– a mapping between Network devices/PoPs and Grid resource
centers/sites
– a mapping between Grid resource centers/sites and Users
• Can be managed using a pull or a push logic
– Users subscribe to a given site and get notified
– Impacted sites publish information on a web site and users fetch
information from there
EGI-InSPIRE RI-261323
12
www.egi.eu
Provided Answers on
Scheduled Maintenances
18
Sched Maintenance
16
1
14
21
22
3
4
3
19
10
2
4
20
12
5
5
2
1
18
6
0
Serie1
8
17
7
16
6
8
15
9
14
4
Sched Maintenance
13
11
10
12
2
0
1
•
2
3
4
5
6
Answer n.3
Having a global EGI tool/service to
warn users and site administrators
about Sched Maint is useful; storing the
information in one place is the solution
to go for, but we cannot commit any
manpower/resource to develop nor
maintain such a tool
Scheduled Maintenances:
Answer from each NGI
6
5
4
3
Scheduled Maintenances
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
EGI-InSPIRE RI-261323
13
www.egi.eu
Network TroubleShooting on
Demand
• Grid site administrators, Operation Centers or
authorized users experiencing problems in reaching a
given site/resource perform troubleshooting on
demand to exclude basic network issues behind the
problems they’re experiencing
• Requires local deployment at the sites of probes
controlled by a central system
• Results in the introduction of different roles
• Basic checks would involve ping, traceroute, reverse
DNS checks, port scan, available bandwidth
measurements
EGI-InSPIRE RI-261323
14
www.egi.eu
Provided answers on
Network Troubleshooting on Demand
Troublesh On Dem
Troublesh on Dem
1
4
22
2
21
18
3
3
20
16
19
14
4
2
5
1
12
18
10
17
6
0
Troublesh On Dem
7
Troublesh on Dem
8
16
6
8
15
9
14
4
10
13
11
2
12
0
1
2
3
4
• Answer n.3:
Having a network tool for
troubleshooting on Demand is
useful, but we cannot commit
any resource/manpower to
contribute to develop nor test it
EGI-InSPIRE RI-261323
Troubleshooting On Demand: Answer from each
NGI
4,5
4
3,5
3
2,5
Troub On Dem
2
1,5
1
0,5
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22
15
www.egi.eu
e2e MultiDomain monitoring
• Users and Site Administrators get network
performances measurements for a subset of e2e
paths within the EGI Infrastructure, getting
monitoring information gathered by scheduled,
periodic measurements
• Muldidomain: NRENs, GEANT
• Monitoring data may include
– Link Availability ( i/f utilization, Input Errors, Output
Drops)
– One-way Delay
– RTT, number of hops
– IPDV(Jitter)
– Available TCP Bandwidth
EGI-InSPIRE RI-261323
16
www.egi.eu
Provided answers on e2e
multidomain monitoring
e2e MD Sched Mon
e2e MD Sched Mon
1
24
23
3
4
4
21
5
3
2
20
7
2
5
22
8
6
6
1
19
6
0
7
18
8
17
5
9
16
10
15
e2e MD Sched Mon
4
e2e MD Sched Mon
11
14
12
13
3
e2e MultiD sched mon: Answer from each NGI
2
7
6
1
5
0
4
1
2
3
4
5
6
e2e MultiD sched mon
3
Answer n.3:
2
1
0
Having an e2e MultiDomain monitoring tool for a specific
subset of of the whole set of e2e paths within EGI
is useful, but we cannot commit resources nor manpower
and cannot afford deploying anything locally at the sites
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
EGI-InSPIRE RI-261323
17
www.egi.eu
DownCollector
• Users, Site Admins and Operation Centers need to
check if services available at various grid sites are
reachable and responsive
• DownCollector developed during EGEE for
monitoring Grid services at the sites
• Migrated from EGEE ENOC to EGI
• Checks services are reachable on specific ports
from a central location, star-based architecture
• Possible evolution would be to have additional
geographically distributed instances, gathering
results
EGI-InSPIRE RI-261323
18
www.egi.eu
Provided answers on
DownCollector
DownCollector
DownCollector
1
24
4
2
23
3
3
22
10
21
20
9
8
5
6
1
19
7
6
4
2
0
18
8
17
5
9
16
DownCollector
10
15
11
14
4
3
DownCollector
7
12
13
DownCollector: Answer from each NGI
2
1
4,5
4
0
3,5
1
2
3
4
3
2,5
DownCollector
2
Answer n.3:
1,5
1
0,5
Having a DownCollector tool is useful but we
cannot commit any manpower nor resources to
contribute to its deployment
EGI-InSPIRE RI-261323
0
1 2 3
4 5 6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
19
www.egi.eu
Policy & Collaboration
• establish an EGI group of people, a body
permanently in charge of interfacing the NRENs,
EGI.eu, EMI, DANTE, GEANT and TERENA to
discuss issues related to
– the provisioning of network connectivity or the upgrade of
existing links,
– new services and new standards
– new tools for monitoring,
– new joint initiatives on tutorials, dissemination on tools,
– testing and prototyping of middleware with respect to the
network layer
so that the requirements, coming from the EGI user
community and the VRCs could be shipped to the
Network community and relevant information is
exchanged
EGI-InSPIRE RI-261323
20
www.egi.eu
Provided Answers on
Policy & Cooperation
Policy and Coop
Policy and Coop
1
24
23
2
3
5
22
12
6
4
4
21
5
3
2
20
10
6
1
19
8
0
7
18
8
17
6
Policy and Coop
Policy and Coop
9
16
10
15
11
14
12
13
4
Policy and Cooperation: Answer from each NGI
2
7
0
6
1
2
3
4
5
6
5
4
Answer n.2:
Policy and Cooperation
3
2
1
Having a Policy and Cooperation Group
is useless.
EGI-InSPIRE RI-261323
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
21
www.egi.eu
How we structured today’s
meeting
•
•
•
•
1. Introduction to the TF and its objectives
2. Report on what we propose for each use case
3. Presentation of tools
4. General Discussion/Feedback from NGIs
– We should decide upon
• Approve a GGUS workflow
– So that it can be implemented within the GGUS system
• Adopting or dropping the proposed tools
• Identify volunteering NGIs for early adoption, initial extended
deployment of tools
• Identify possible missing bits or uncovered use cases/unsatisfied
requirements to work upon
EGI-InSPIRE RI-261323
22
www.egi.eu