Keynote: Networking Challenges for the Next Decade

Networking Challenges
for the Next Decade
Amin Vahdat
On behalf of Google Technical Infrastructure and Google Cloud Platform
APRIL 4, 2017
Google Network
More than a collection of data centers
FASTER (US, JP, TW) 2016
SJC (JP, HK, SG) 2013
Unity (US, JP) 2010
Network fiber
Points of presence >100
Google Global Cache edge nodes
Google Cloud Regions
Adding 11 new regions
3
Netherlands 2
London 3 3
Oregon
2
California
Iowa
3
4
3
3
3
Montreal
3
Finland
Frankfurt
Belgium
N Virginia
3
S Carolina
3
3
Mumbai
2
3
#
Current regions and number of zones
#
Future regions and number of zones
Tokyo
Taiwan
Singapore
São Paulo
3
Sydney
Ubiquitous Cloud...10x Scaling
Datacenter
Campus & Metro
WAN
Next-gen disaggregation of
storage, memory and compute
Cloud regions and campus
expansion driving DC
interconnect
Cloud replication and
bandwidth intensive cloud
services (e.g., turnkey video,
IoT)
10x
10x
10x
Step Function Disruptions: Bandwidth, Latency, Availability, Predictability
The Pillars of SDN @ Google
B4
Andromeda
Jupiter
WAN
Interconnect
NFV and network
virtualization
Datacenter
Networking
The Pillars of SDN @ Google
B4
Andromeda
Jupiter
Espresso
WAN
Interconnect
NFV and network
virtualization
Datacenter
Networking
SDN for public
Internet
B4: Google's Software Defined WAN
B4: [Jain et al, SIGCOMM 13]
BwE: [Jain et al, SIGCOMM 15]
B4 traffic
B4: From Copy Network to Business Critical
2012 — 2016
B4: [Jain et al, SIGCOMM 13]
BwE: [Jain et al, SIGCOMM 15]
Andromeda
Google Infrastructure Services
VNET: 10.1.1/24
VNET: 192.168.32/24
Load Balancing
DoS
ACLs
VNET: 5.4/16
VPN
NFV
ToR
ToR
ToR
ToR
Internal Network
10.1.1/24
10.1.2/24
10.1.3/24
10.1.4/24
Google Datacenter Network Innovation
Capacity
And hardware scale that we could not buy
Jupiter
Watchtower
Firehose
1.0
Saturn
4 Post
1.3Pb/s clusters
in 2013
Firehose
1.1
Time
10
The Pillars of SDN @ Google
B4
Andromeda
Jupiter
WAN
Interconnect
NFV and network
virtualization
Datacenter
Networking
Public
Internet?
The Pillars of SDN @ Google
B4
Andromeda
Jupiter
Espresso
WAN
Interconnect
NFV and network
virtualization
Datacenter
Networking
SDN for public
Internet
Espresso in Context
B4
Jupiter Data Center
Google
Espresso in Context
Peering Metro
B2
B4
Jupiter Data Center
Google
Google
Espresso in Context
User
Peering Metro
B2
Espresso
B4
Jupiter Data Center
Google
Internet
Google
Espresso: Before and After
Router
Cloud
Centric
1.0
Protocols
Local view
Connectivity first
Coarse fault recovery
Espresso
SDN
Peering
Per-metro and global view
Application signals
Real-time optimization
Espresso Architecture Overview
Espresso
Metro
Peering Fabric
BGP
speaker
Label-switched
Fabric
eBGP Peering
External Peer
Espresso Architecture Overview
Espresso
Metro
Peering Fabric
Host
BGP
speaker
Label-switched
Fabric
eBGP Peering
External Peer
Host
Host
Host
Host
Host
Packet
Processor
Labeled packets
specify egress
Host
Host
Host
Host
Host
Espresso Architecture Overview
Global Controller
Espresso
Metro
Application Signals
Local
Control
Peering Fabric
Host
BGP
speaker
Label-switched
Fabric
eBGP Peering
External Peer
Host
Host
Host
Host
Host
Packet
Processor
Labeled packets
specify egress
Host
Host
Host
Host
Host
Next Decade Challenges
in Networking
The next wave in computing
• Serverless compute in Cloud 3.0
•
•
IoT
Tightly coupled, general purpose
distributed computing
It’s time to put it all together
• Agile Scale
•
•
•
Jitter
Isolation
Performance is great, but only
meaningful with availability,
manageability, and velocity
Last Decade
Cloud 1.0
Virtualization delivers capex savings to enterprise DCs
Now
HW on
Demand
Cloud 1.0
Cloud 2.0
Public cloud frees enterprise from private HW infrastructure
Scheduling, load balancing primitives, “big data” query processing
The Third Wave of Cloud Computing
Compute,
not servers
Cloud 1.0
Cloud 2.0
Serverless compute, real-time intelligence, and machine learning
Not data placement, load balancing, OS configuration and patching
Cloud 3.0
The Third Wave of Cloud Computing
Cloud 1.0
Cloud 2.0
Cloud 3.0
Networking should be aiming for Cloud 3.0
Networking and
Cloud 3.0
Storage disaggregation:
the datacenter is the storage appliance
Seamless telemetry
and scale up/down
Transparent live migration
Open Marketplace
of services, securely placed and accessed
Networking and
Cloud 3.0
Applications+Functions
not VMs
Policy
not middleboxes
Actionable Intelligence
not data processing
SLOs
not placement/load balancing/scheduling
Next Decade Challenges
in Networking
The network will enable next-generation
compute infrastructure
The network can define next-generation
storage infrastructure
The right network infrastructure can deliver
fundamental new capability
How we Prioritize
Infrastructure Work
Performance
Stranding
Velocity
Manageability
Availability
Availability is Paramount
•
•
•
•
•
First things first: an insecure infrastructure is an unavailable infrastructure
Stability is more important than efficiency
Network management is critical
Configuration is hard
Automation matters but can be counter to availability
“Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure.”
SIGCOMM 2016.
Build for Velocity
•
•
•
•
•
•
Velocity is the speed of iteration
Retrospective on “Tussle in Cyberspace:
Defining Tomorrow’s Internet”
Build for hitless upgrades and
self-validation
Debugging and tracing matter
○
Without visibility, performance
does not matter
Network fabrics built for expansion and
evolution
Launch and Iterate
Isolation is Critical; Stranding is Terrible
Isolation with reservations is easy but leads to huge resource stranding
● General-purpose, shared infrastructure to approximate custom-built and reserved
Isolation has many components
● Latency, bandwidth, but also the control plane
●
Accounting and chargeback are big missing pieces
Congestion Control is still really hard
● Rationalizing multiple control loops, flow, endpoint, flow group, Traffic Engineering
Performance only
Matters if End to End
Amdahl’s law applies and so an incredible,
localized optimization that takes any effort
to adopt will be ignored
1.
2.
3.
Scale
Jitter
Storage Disaggregation
Must optimize from the application all the
way to the end user
How we Prioritize
Infrastructure Work
Performance
Stranding
Velocity
Manageability
Availability
Next Decade
Challenges in Networking
The next wave of computing
• Serverless compute in Cloud 3.0
•
•
IoT
Tightly coupled, general purpose
distributed computing
It’s time to put it all together
• Agile Scale
•
•
•
Jitter
Isolation
Performance is great, but only
meaningful with availability,
manageability, and velocity
Thank You!
Thank You!
Open Source
Google
MapReduce
Google
Bigtable
Google Borg
Google Cloud Platform
Google Borg
Google
Dremel
36
Open Source
QUIC
TCP
BBR
gRPC
Google Cloud Platform
Open
Config
...
37