GN3(Plus)2 / GN4 New Idea Form Advanced Virtual Networks

GN3(Plus)2 / GN4 New Idea Form
Advanced Virtual Networks (Continue SA2 and SA3)
In GN3Plus the SA2 “Testbeds as a Service” effort was tasked to provide network
research “testbeds” as a production service. This effort has resulted in a
generalized architecture that allows a wide range of network resources to be
integrated into a single virtualized distributed service environment. The initial use
case was to support network research, and so the Testbeds as a Service (TaaS)
provides Linux based virtual machines, Ethernet framed virtual circuits, and
virtualized hardware that supports Ethernet and/or OpenFlow switching as the
basic elements that can be incorporated in to an experimental network environment.
The proposed SA2 Testbeds model is dependent upon dynamic virtual circuits to
provide dedicated network capacity and performance for the Virtual Circuit
resources testbeds use to link the experimental network elements. The SA2 TaaS
service expects an NSI based SA3 Bandwidth on Demand service to be available.
The NSI based BoD service is a critical in that it allows the testbeds service to extend
the virtual circuits well beyond the GN3/GN4 domain. This will allow an Advanced
Virtuailzed Network service to be replicated by other domains and to create
testbeds that span multiple domains.
The TaaS model is extensible. It defines a base set of primitives for manipulating
resources regardless of the resource type, it leverages OpenStack software for
interfacing with and managing virtual machine resources (making it “cloud
friendly”), and it leverages the NSI protocol for provisioning virtual circuits
(enabling global reach of data plane circuits). The service allows for recursive
object oriented construction of complex networks from simpler more atomic
resources which allows the architecture to address large scale testbeds.
These virtualized network environments can be used to address a much broader set
of network experimentation or network management needs than simply those of
network researchers. It is this broader set of use cases that we propose should be
addressed in GEANT 4. Common communities, such as the high energy physics
community, the radio astronomy community, bio-informatices, etc. are exploring
more effective means of network design to address the data distribution
requirements of their collaborations and science workflows. For instance, the use
of dynamic virtualized network services can be leveraged by the HEP community to
construct different versions of the LHCONE network topology to explore and
evaluate each version’s efficacy to the workflow requirements. Similarly, the eVLBI
community has resources that are distributed around the world and have real time
data transport requirements. The data transport topology requirement changes
substantially from one science experiment to another. Dynamic virtual networks
can construct application specific networks – including computational processing
and data aggregation/storage components – to meet these global requirements.
Other science communities are similarly organizing to develop global science
networks that are tailored to that communities needs and preferences. There is not
technical difference between an application specific virtual network established for
network research than one established to support other science research programs.
Other use cases include emerging global IT services such as high quality interactive
video and realtime immersive environments. These services are scheduled and
expect high quality data transport among multiple sites globally. Integral
components of such services include multi-party control units (MCUs),
streaming/capture elements, or transcription components – and the geographic
juxtaposition of these processing and transport elements can have a significant
impact on the quality and human experience of a specific virtual meeting. Thus
both the per-conference video streams will define specific topologies, and the
overall global video services network will require network performance guarantees
and processing and switching aspects that will vary over time.
For GN4, we propose that the Testbeds as a Service evolve into a Virtualized
Networks Service. This service will provide enhanced capabilities such as:
a) On the order of 4-32 Bare metal virtual servers per pop (essentially
dedicated blade servers)
b) Small scale VM capability on the order of 10 to 100 VMs per pop (could be
provisioned on blade servers above)
c) Multi-domain integration with independently administered large scale
VM/cloud services (greater than 10^3 VMs) to provide substantial
[virtualized] computational/modeling facilities
d) Enhanced switching resources including virtual routers/switches (and
conventional routing protocols), openflow hardware, optical or photonic
elements, etc.
e) High speed, distributed, large scale virtual storage – order 2-10 TB/VM,
and/or 10-100 TB/pop.
A key element of the GN4 Virtualized Networks Service would be the multi-domain
capability. The service must be capable of interacting with other service domains
that deploy the same service (such as the GEANT NRENs or organizations that may
have VM data centers to participate.) This will allow users to construct networked
environments that span multiple administrative domains.
The cost of this GN4 service will be directly related to the amount of infrastructure
deployed (the blade servers, the number of pops deployed, the amortized cost of
virtual routers, switching devices, etc.). A hardware estimate would be 100K
EUR/pop time ~25 pops = 2.5M EUR in the first year, and a similar cost for
hardware refresh in Year 4. An annual maintenance cost is estimated at 10% for
250K EUR/yer. The service will require development to code the interface
primitives for each new species of virtual resource introduced, and to enhance the
existing functionality such as dynamic IPv{4,6} network addressing/allocation,
enhanced virtual network description grammar and user interface, optimized
dynamic resource/instance mapping and optimization, freeze/thaw and migration
handling, error handling/notification, security and performance monitoring, AAI, etc.
The software development is estimated at approximately 4-6 programmer FTE per
year, plus 4-6 hardware and systems engineering FTEs per year, plus 4 service
management personnel for a total of 12-16 FTEs per year for each of the first four
years. A reduced staff of 8-10 personnel is possible in later years assuming the
service technology has matured and fewer novel capabilities are being implemented.
Thus the personnel cost would amount to approximately 2.8M EUR/yer through
year 4, and approximately 1.8M EUR in the out years. With an average FTE cost of
150K EUR, the total seven year estimate is ~ 20M EUR over the seven year lifetime
of GN4.
Performance Verification
The emergence of “performance guaranteed” services - such as the GEANT
Bandwidth-on-Demand service and similar services being deployed around the
world – is a response to the increasing pressure by global collaborative [science]
communities for network services that are agile and offer predictable and reliable
performance guarantees. These services are not simply a means of acquiring
transport capacity for FTP, but provide a means of planning and coordinating
information sharing requirements among distributed science centers both within a
global affinity group (a community of collaborators such as the physics community,
or bioinformatics community) as well as and insuring these activities can be reliably
allocated across many such communities and applications. These services are a
foundation for application specific networks (such as LHCONE or eVLBI),
experimental testbeds (SDN, OpenCalls, etc), advanced services (such as global
integrated video services networks), and even conventional multi-layer network
substrates. Because these services are asked to dedicate real network assets to a
specific user, and the users will make large scale investments based upon these
services being reliable and predictable – there will need to be a means of verifying
the performance of these services. Performance verification is a means for the
provider to validate their provisioning and for the user to insure their service
instance(s) are functioning properly. And with an appropriate verification
architecture, it can also be a means to localize performance problems and initiate
corrective actions.
The Performance verification process must define a means for invoking the process,
specifying the service guarantees to be verified, performing both an appropriate test
sequence and results analysis, and notification of faults. As the services under
scrutiny are globally provisioned over multiple independent domains, the
verification process must also be global – i.e. it must be scalable, secure, insure
privacy, it must respect network administrative autonomy end to end. It must be
deterministic and detailed in terms of measurement and analysis in order to
accurately identify subtle failure modes. It must be automated in order to provide
rapid detection and to enable automated corrective processes. (This proposal does
not address mitigation and recovery – simply failure detection, localization, and
notification.)
This NIF is divided into two phases:
The first phase will develop an initial PV Architecture document and a proof-ofconcept implementation. Phase 1 will develop the basic PV protocol agent(s) and
tools that leverage the emerging NSI framework for inter-domain connection
provisioning. This PoC will show how detailed deterministic automated fault
analysis can be performed end-to-end for fault localization.
The second phase is intended to enhance the PV capabilities. Advanced timing
mechanisms will be incorporated including high speed micro-programmable NIC
hardware for active packet scheduling and passive time stamping at link speed in 10
and 100 Gbps regimen. Phase 2 will also enhance the intelligent test/analysis cycle
to resolve and identify the failure mode(s) (not just the failure location.) Phase 2
will also extend the PV services to include greater participation both within GEANT
footprint and globally within the GLIF and NSI community. Phase 2 is also intended
to incorporate highly accurate clocking for jitter characterization. A potential
integration of the PV effort with the Time and Frequency Distribution work could
result in both extremely accurate performance timing studies as well as novel endto-end latency measurement capabilities.
Note: This Performance Verification NIF is not intended to ignore perfSonar. But
we believe it is imperative that we pursue PV from a clean slate perspective. It is
more important that we devise a scalable PV architecture that is applicable to
emerging and anticipated service environments than to worry (at this stage) about
its potential backward compatibility with legacy tools or practices. Where the
clean slate approach overlaps existing best practice or infrastructure, we can
explore integration with that infrastructure. But we want to first demonstrate the
core functionality of deterministic automated PV within the context of future
services then try to understand if or how prior monitoring tools might be applicable
(or reusable.)
Resource required to develop and refine an advanced Performance Verification
network architecture and service will require:
Phase 1. 1 year:
1. Two full-time systems software developers to work out the PV protocol,
and then develop the protocol agents, test tools, and basic problem
resolution analysis workflow processes. The Phase1 proof of concept
should also develop basic packet scheduling and passive measurement
tools with simple timing mechanisms (high accuracy timing will be phase
2)
2. One FTE hardware engineers to develop the line interface hardware for
active/passive testing and stream capture, etc..
3. A systems architect at 50% FTE will be required to shepherd the PV draft
architecture document and manage the phase 1 development.
Phase 2. 3-6 years
1. Two FTE system software developers to continue the high level software
development and to develop the low level high accuracy timing features
required for traffic shaping and passive measurement at 100 Gbps.
2. 1 FTE hardware engineer to develop the high speed timing hardware.
This person may be different from Phase 1 in that this phase is focused on
extremely high speed time stamps rather than line interfacing.
3. 50% FTE for management. The effort in Phase 2 will be to refine the
architecture and to develop a concensus standard document and multidomain expansion of the PV capabilities. This stage can also look at
integration with other monitoring services to provide a comprehensive
automated process.
Distributed High Resolution Timing Service
There is an increasingly important need for highly accurate time to be available to
networks and researchers. Examples of such need include global scale
interferometry (eVLBI, SKA, etc) where scientific sensors record high frequency
time domain signals at multiple points around the world. These traces must have
extremely accurate timing available at the capture stage so that the data can be time
stamped for subsequent correlation between sensor traces. In advanced networks,
accurate time is required for synchronous communication protocols and is essential
for detailed and deterministic study of network traffic behavior and jitter
characteristics at the 100 Gbps regimen. In many cases, coordinated timing is also
required enabling the real time ability to determine properties such as latency
between two locations. The science aspects are generally much more sensititve and
require extremely accurate time.
There are a number of European projects exploring the ability to employ
conventional fiber optic networks to utilize a portion of the spectrum to deliver the
timing signals necessary to distribute very accurate timing among geographically
distributed sites. In particular, the GEANT Open Call “International Clock
Comparison via Optical Fiber” (ICOF) is attempting a proof of concept test between
London and Paris to do this timing distribution.
This NIF proposes that this technology be seriously considered for GN4. IF possible,
GN4 should allocate a portion of the fiber optic spectrum for the delivery of such
optical signals. It should be noted that the current technology will require special
engineering to allow the bidirectional propagation of signals – which will have
implications for the in-line amplification and regeneration engineering along the
fiber path.
It is presumed that timing re-distribution hardware and algorithms will be
necessary in each pop to distribute time through out the GEANT footprint (not just
point to point) and to provide that accurate time as a service for upper layer
network and science applications.
This NIF proposes an applied research project with the expressed objective of
delivering extremely accurate and coordinated time at each GN4 pop. This effort
should specify targets for maximum accuracy, smallest resolution, and any other
relevant measure. Resolution of 10^-15 seconds may be possible (which is a
million time smaller than a nano-second – and more accuracy may be possible.)
The GN4 project should pose a service architecture for time and frequency
distribution that could scale to the entire GN4 footprint.
Related effort would explore how a Highly Accurate Time Service (HATS?) could
potentially be peered with other HATS environments (perhaps similar services
deployed by the NRENs.) The project should explore algorithms or processes for
multi-domain service peering in order to coordinate time between multiple
independent timing services. Such inter-domain time distribution will require
security to protect against bad clcoks being introduced. The project should look at
both a model for service peering and the issues that must be addressed.
A related research objective would explore and define technical requirements for
ultra-longhaul time and frequency distribution targeting submarine cables. Such
trans-oceanic timing, when combined with continental timing nets, would
potentially allow for a global timing network to evolve.
A highly accurate time service would be a major (disruptive!) new capability for
research in Europe. To date, such accurate and coordinated time is not available as
a service in any conventional optical networks and is not presently being explored
by any other regional or international networks. This project if successful could put
GEANT in a leadership role in developing this powerful new technology/service for
both science and network engineering.