ESnet - CERN Indico

Architectural and Service Considerations for a
Distributed Transatlantic Exchange Point
for LHC Tier 2 traffica
W. E. Johnston
Introduction
It is clear from the Bos-Fisk report on Tier 2 requirements that “unstructured” T2 traffic will be
the norm in the future and is already showing up in significant ways on the Research &
Education network infrastructure. The R&E network provider community must gain control of
this traffic both to ensure good service for the T2s and to ensure fair-sharing of the general
infrastructure with other uses.
At the same time it must be recognized that the T2 and T3 sites will have widely varying needs
and capabilities: Some will not use anything but their regular routed IP connections for the
foreseeable future and others would happily use a circuitb-based infrastructure if it were available
and affordable.
This paper is intended to explore the operational and architectural issues of some of the possible
solutions.
The proposed solutions are both variations of a distributed exchange point with “aggregation”
nodes strategically placed around Europe and the US. An aggregation node would be a switchrouter that could:
1) Provide IP access for the LHC community by advertising low cost routes to common data
sources (T1s and T2s), and;
2) Act as a circuit ingress point that speaks the DICE IDCc protocol for circuit setup.
These nodes would most likely be co-located with existing R&E exchange points such as MAN
LAN, Starlight, MAX, etc., in the US and in Europe in some of the GÉANT routing nodes and
some of the independent exchange points like NetherLight.
Underlying the switch-router nodes would be either an engineered lightpathd infrastructure
dedicated to this purpose or a federated collection of independent nodes operated by various
R&E networks. Both cases would allow for traffic engineering that would manage traffic on the
a
The ideas in this white paper have evolved considerably in the course of discussions with the ESnet Engineering Group,
especially Joe Metzger and Kevin Oberman; however any technical inconsistencies are the fault of the author.
b
We will use the term “circuit” (or virtual circuit) to mean a user initiated, end-to-end path that provides bandwidth guarantees
and perhaps other guarantees such as reliability.
c
See http://www.controlplane.net/
d
“Lightpath” is not a well-defined term. It was originally used to refer to an optical end-to-end path, as in “Applications Drive
Secure Lightpath Creation across Heterogeneous Domains,” Gommans, L.; Dijkstra, F.; de Laat, C.; Taal, A.; Wan, A.; Lavian,
T.; Monga, I.; Travostino, F., Communications Magazine, IEEE, March 2006, and “Semantics for Hybrid Networks Using the
Network Description Language,”Jeroen van der Ham, Paola Grosso, and Cees de Laat. OGF Document Series, GFD.165, 201003-08, Informational Document from the Infrastructure Area and NML-WG group. (“Dedicated optical circuits (lightpaths) can
run over wavelengths (lambdas) available in the optical part of a hybrid network. Lightpaths are assigned to applications or users
for their data traffic, creating a lambda network, where the lightpaths cannot interfere with each other.”). However, we will use
the more current (GLIF) generalized definition that a lightpath is a direct connections between two (not necessarily adjacent)
nodes in the network without the need for routers. Lightpaths are characterized by guaranteed capacity, quality, and reliability:
For example an appropriate Ethernet VLAN or SONET circuit in addition to lambdas / waves.
DRAFT, 2010-11-11, [email protected]
exchange point (single domain or federated domains) internal lightpaths in order to optimize the
use of the transatlantic capacity.
Both models are considered here, however it might be the case that the “federated collection of
community owned lightpaths” approach is an interim step while political support and funding are
organized for the purpose-built approach.
Characteristics of transatlantic networking
There are characteristics unique to submarine cable systems that need to be addressed. Apart
from the expected occasional long outages due to marine cable damage, experience has shown
there are numerous short (a few minutes to 20-30 minute) circuit interruptions. See Figure 1.
Figure 1.
Cable outage durations on several transatlantic marine cables over a period of several
months. The lower graph is dilated in the event length to better see short outages. Note that the vertical grid
lines are one day. Data from USLHCNet, ESnet, and GÉANT.
For circuit based applications or for IP clouds for which there is only one physical path
connecting two routers (e.g., on the opposite sides of the Atlantic) this sort of behavior could
have distinctly negative impacts.
The service offering
There are at least three network services that must be supported by the exchange point: 1) best
effort IP transit in the context of a consistently managed BGP cloud that would provide access to
the US LHC data-related sites from Europe and vice versa, 2) user managed virtual circuits
(“circuits”) with bandwidth guarantees, traffic isolation, and perhaps protection, and; 3)
standardized cross-domain monitoring and testing (i.e. perfSONAR).
2010-12-23, [email protected]
2
Given the behavior of the transatlantic paths described above, consideration also needs to be
given to addressing this issue.
A common way for data centers and large user sites to use circuits is as VPNs that interconnect
routers at the end sites that advertise limited or private address spaces. In this case if there is
more than one VC interconnecting the routers, and those VCs are on diverse paths, then in the
event of a path failure the user site BGP can reroute the traffic to a circuit on a different path.
There are other options here as well including using aggregated Ethernet (or SONET) links, and
using BFD or LACP to deal with individual circuit outages (or LCAS & VCAT).
In the case where the application is moving data in long-lived serial streams then a fast fail-over
protect mechanism may be needed in order to prevent significant disruption. For this purpose a
mechanism that is below the circuit transport layer is needed so that the fail-over does not disrupt
the virtual circuit (that is, the circuit definition and integrity remain in-tact).
Even in the case of highly parallel data movers, the absence of a second path for a redundant
circuit or for BGP to use would result in resetting all TCP sessions of the parallel data transfer,
undoubtedly resulting in a heavy-weight response at the application layer, so a protect
mechanism or redundant lightpaths managed within the exchange point domain by, e.g., OSPF,
would need to be provided if reliability is part of the circuit service offering.
There are fast re-route mechanisms in both SONET and MPLS that can be used to provide a
protect mode. In both cases the devices at each end of the VC that manage the fail-over must be
compatible and must operate in a control domain where both ends of the path can be configured
consistently. In other words, this is typically done in a single network domain where all of the
equipment involved is compatible and configured by a single network operator, though some
generalization to multi-domain environments may be possible.
The nature of these paths (e.g. co-linear or diverse) would be determined by the technology
providing protect-mode and the physical configuration of the lightpaths. In Error! Reference
source not found. the protect paths are shown as co-linear pairs, but most modern technology
does not require this.
Whatever fast reroute / protect mechanism is used, it should provide for alternative uses of the
protect lightpath when these lightpaths are not in use for fail-over. That is, the fail-over
mechanism should be able to preempt other traffic on the lightpaths that protect circuits. Policywise this means that it must be possible to identify and manage multi-priority traffic.
2010-12-23, [email protected]
3
Figure 2.
A minimal exchange point architecture that provides protection for single lightpath based
circuit connections between users and redundancy for exchange point load balancing.
A purpose-built infrastructure approach
With purpose-built infrastructure it is assumed that the infrastructure will be operated as a single
network domain.
As an IP service, the distributed exchange point would be set up as an independent network
domain. The BGP cloud (announcing LHC-interest routes) would be implemented on a
collection of routers most likely located at existing major R&E exchange points in the U.S. and
in Europe. These routers would connect to a collection of transatlantic cable lightpaths and
terrestrial, inter-node lightpaths for redundancy and load-balancing. Connections to the exchange
point routers would be available at the exchange point points of presence (PoPs) in a fashion
similar to how the single location exchange points operate today. This would provide a robust,
high capacity IP transit network across the Atlantic (Error! Reference source not found.).
As a virtual circuit service, connection points for each request would be determined by the
exchange point IDCa domain circuit manager. Circuits are expected to carry most of the traffic as
the Tier 2 sites obtain connections to the closest exchange point node. As such, provision must
be made for reliability of these circuits. This can be done in one of two ways.
Most of the existing use of circuits by the LHC community is essentially as VPNs. That is, the
circuits interconnect site routers at each end of the circuit. The site routers make BGP
announcements across the circuit to a site at the far end. If there are redundant circuits on
independent lightpaths, then the BGP sessions at the ends will decide how those are used, but the
possibility of a user-managed fail-over strategy is clear. A concrete example of this is illustrated
in Figure 3. In this case independent loads are offered to the two primary paths by two site
routers which share a third path for backup. In this case there is no compelling need for a protectmode circuit because the multiple circuits between the two set of site routers are managed by
BGP.
In the case of a single connection between the two sites, whether it is being used as a VPN or as
an Ethernet circuit, then a protect path becomes important as noted above. It may be that a more
efficient use of transatlantic bandwidth could be obtained by having the exchange point operator
provide and manage a protect circuit rather than to have sites reserve two circuits and manage the
protection as redundant circuits with BGP at the user sites. In this way the protect circuit might
be used for other purposes when not serving as a protect path.
a
The Dante, Internet2, CANARIE, and ESnet (“DICE”) group has defined an inter-domain protocol (IDCP) that domain
controllers can use with other controllers to communicate circuit needs and then establish the circuit end-to-end. The domain
controllers are called IDCs. See http://www.controlplane.net/.
2010-12-23, [email protected]
4
Figure 3.
Example of use of multiple circuits to provide a redundant lightpath infrastructure for user
BGP management. (The ESnet domain circuits are managed by OSSCARS – ESnet’s IDC.)
In the design of the exchange point two load management issues need to be taken into account:
One is to manage the offered load so that the input to any given node does not become
congested, and the other is to manage the egress load from each node so that the inter-node paths
do not become congested.
The input load could be managed in several ways.
In the case of non-circuit IP traffic, since the same routes will be advertised at all nodes in
Europe for the US sites and visa versa in the US, MEDs (multiple exit point discriminators) can
be used at each node to indicate to its peers which is the optimal entry point into the exchange
point network for particular routes. The exchange point operator would dynamically adjust the
MEDs to optimize the lightpath loading between the nodes (within the exchange point) and thus
permit congestion on internal paths. For this to work two conditions must be met. One is that the
sites offering load must take into account the preferred route information being provided by the
MEDs. The other point is that there must be enough available path diversity in the terrestrial
lightpaths between the sites and the exchange point nodes so that a given traffic source would
have paths to the preferred node for the route (as specified by the MEDs), which might not be the
“closest” exchange point node.
In the case of circuit based traffic the IDCs involved in the end-to-end circuit will exchange
topology information indicating available ingress / egress points between domains. (The
distributed exchange point is a single circuit domain.) When a circuit is requested, each IDC
involved in the end-to-end path will identify the current optimal path internal to its domain, and
the corresponding the domain ingress / egress points (nodes). These interconnect points will be
used to set up the circuit with adjacent domains. This will automatically manage path loads
internal to each domain since any path identified as suitable for the circuit will not have
conflicting uses since the user circuits have guaranteed bandwidth.
2010-12-23, [email protected]
5
Once again, for this to be a successful strategy there must be a diversity of paths available
between the sites and/or adjacent domains and several of the exchange point nodes.
The lightpath infrastructure underlying the routers of the exchange point would need to have
several characteristics.
In order to provide the most straightforward load balancing od non-circuit IP traffic there needs
to be some “lateral” capacity (terrestrial lightpaths) interconnecting the routers an each side of
the Atlantic in addition to the transatlantic capacity. Given this redundancy, load balancing
internally could be accomplished by an intra-domain routing protocol such as OSPF.
In the case of circuits the load balancing would be controlled by an IDC that has a
comprehensive view of the topology and current use of the exchange point. In addition to this
there must be some internal paths that are capable of providing protected paths in order to
support reliability as a circuit service characteristic.
A federated collection of community owned lightpaths approach
In the federation model each network operator maintains control over its resources but all
operators cooperate to provide a relatively homogenous-appearing exchange point.
To sketch out nothing more than a feasibility argument for the federated exchange, consider the
following.
General IP service:
A uniform BGP cloud could be simulated by each operator advertising the same set of routes on
both sides of the Atlantic. Since the operators have different points of presence locations (e.g.
New York, Washington, and Miami) the issue is to be developing a mechanism that would load
the available transatlantic lightpaths of the Federated domains more-or-less uniformly. One way
to do thisa is to use a dynamic MEDs approach.
Unlike a single domain where MEDs can be set using complete knowledge of local conditions,
federated multiple domains do not have the global knowledge to use MEDs to load balance over
multiple domains.
A coordinated use of MEDs – based on global information about the load state of each individual
domain – by multiple networks providing transatlantic service and all advertising the same
routes, could effectively distribute loads evenly across multiple paths.
The global view needed to dynamically reflect transatlantic lightpath loading can be provided by
having the individual networks of the federated exchange point sample their transatlantic
lightpath loadings periodically and then report those loads to a central agent. The central agent
would then compute new MEDs based on the current loads of all of the federated paths with the
goal of directing traffic to more lightly loaded paths. The new MEDs would be distributed back
to the networks of the federation to install and would favor routes on less loaded transatlantic
paths. The traffic flows in the U.S. or Europe networks would then use the terrestrial R&E
infrastructure to gain access to the transatlantic paths of federated networks whose transatlantic
routes were the least loaded, thereby uniformly distributing the load across the available
transatlantic paths. The path loads would be reported periodically (e.g. hourly) and the MEDs
adjusted accordingly, thus steering future connections to different federation PoPs.
a
This idea is due to Kevin Oberman of ESnet
2010-12-23, [email protected]
6
This approach does not supersede the operational requirements of the individual operators of the
federated infrastructure because the operators have complete freedom to report path loads in any
way that they wish. For example, if they only want to use 50% of their transatlantic capacity to
support the federation, then they can just report the “loading” based on utilization of 50% of the
lightpath bandwidth rather than 100% of the bandwidth. In this way, e.g., 5 Gbps of traffic on a
10 Gbps lightpath is 100% load with respect to the federation use of their capacity.
Virtual circuit service
Providing a virtual circuit in a federated environment is more complicated, but uses the same
general idea of an external agent providing the global knowledge needed to load balance across
independent (but federated for this purpose) domains.
A federation IDC (inter-domain circuit manager) could operate as an über-controller providing a
single contact point for transatlantic circuits and managing a global view of the use of the
available capacity.
The federated networks would export that portion of their topology on which they permit VC
service, together with the constraints (e.g. the maximum bandwidth allowed for circuits on each
lightpath). The federation IDC would use a constrained shortest path routing algorithm to define
a path through the distributed exchange point. The topology exported by the federated networks
would be relatively static since it would likely be governed by policy (e.g. what lightpaths are
available for use by VCs and with what bandwidth) in each federation member, rather than by
usage which is overseen by the federation IDC. That is, the dynamicism – the lightpath usage
churn as VCs are set up and torn down – would be managed by the federation IDC as the übercontroller.
Due to potential user constraints this approach is not as flexible as the BGP/MEDs approach
when compared to the corresponding non-federated version of the service.
For example, the user attempting to set up the circuit may not have the flexibility to make use of
the exchange point ingress or egress location provided by the über-IDC because of constraints in
how the user can route their part of the circuit in order to get to an arbitrary exchange point PoP.
That is, if all transatlantic circuit capacity from a particular PoP is consumed then the user may
not be able to construct a circuit to a different PoP where there is capacity, and where the
distributed exchange point routing algorithm has defined the ingress and egress points for the
circuit, because the user may not be able to get circuit capacity to the alternative PoP.
This issue might be addressed by having the federation operators provide terrestrial capacity
between the exchange point PoPs in the U.S. and similarly in Europe (the “laterals” described
above) so that the transatlantic path could effectively be back-hauled on terrestrial circuits to
where the user is able to connect. However, this approach of providing terrestrial capacity
between the exchange point PoPs could have its own set of political issues separate from
providing transatlantic bandwidth.
Nevertheless, in principle, the tools to support this approach to federated transatlantic circuit
management exist, and so a starting point exists.
Monitoring and testing
Providing a federated monitoring and testing infrastructure should not present any problems. The
federated use of perfSONAR is well established and already deployed in many of the networks
that might participate in a distributed exchange point federation.
2010-12-23, [email protected]
7
Governance
The governance mechanism might involve two sets of stakeholders: The WLCG representing the
experiments overall and the funders. This needs to be worked out with the LHC community. In
the case of a federated infrastructure, the infrastructure management would be by a committee of
the provider representatives on one hand the WLCG and/or LHC sites on the other hand.
Financial Support
For the purpose-built case, an approach might be taken similar to what some of the R&E
exchanges in the US have done: Initial “central” funding for constructing the exchange point,
then by subscription of the users to support operations.
Conclusions
High performance, high-capacity transatlantic networking is critical for modern science. The
current transatlantic infrastructure is too fragmented to provide the needed capacity and
reliability. The model of the R&E open exchange points can be applied to aggregate existing
capacity into a coherent infrastructure can provide the required capacity and services for dataintensive science. The current R&E providers of transatlantic capacity have the circuit capacity
and diversity, and the technical capability to build and operate a potentially highly effective
distributed transatlantic exchange point if a suitable operational and capability model can be
agreed upon.
2010-12-23, [email protected]
8