Building Internet Scale Distributed Building Internet

Department of Computing
Building Internet
Internet-Scale
Scale Distributed
Systems for Fun and Profit
Peter Pietzuch
[email protected]
Large-Scale
g
Distributed Systems
y
Group
p
http://platypus.doc.ic.ac.uk
DistributedPeter
Software
(DSE) Section
R.Engineering
Pietzuch
[email protected]
p
p@
Department
of Computing
Imperial College London
Oxford University Computer Laboratory – Oxford – June 2009
Internet-Scale Distributed Systems • Search engines (e.g. Google, Yahoo, ...)
– Global crawling, indexing and search
• Google: over 450,000 servers in
at least 30 data centres world-wide (?)
• Content delivery networks (CDNs) (e.g. Akamai, Limelight, ...)
– Scalable web hosting
hosting, file distribution,
distribution media streaming,
streaming ...
• Akamai: hosting for Microsoft.com, CNN.com, BBC iPlayer, ...
• Social networking sites (e.g. Facebook, Twitter, LinkedIn, ...)
• Facebook: serves 200 million users and stores 40 billion photos
• Cloud computing applications (e.g. Amazon, Microsoft, Google, ...)
– Pay-as-you-use
Pay as you use storage and computation for applications
• Amazon: bought servers worth $86 million in 2008 alone
2
Internet-Scale Distributed Systems • Peer-to-peer computing (e.g. Bittorrent, BOINC, ...)
– Contribute users
users’ resources for file sharing, scientific computing
• Bittorrent: “1/3 of all Internet traffic” (?) [CacheLogic’04]
• @home computing: Quake-Catcher@home
SETI@home
• Large-scale test-beds (e.g. PlanetLab, Emulab, ...)
– Possible to deploy research
systems in real-world
l
ld
• PlanetLab: 1041 nodes
at 500 sites (May’09)
– Great for student projects!
3
Properties of Internet-Scale Systems
• Large number of users, requests, resources, ...
– Single/multiple data centres, hosts and/or mobile clients
( Requirement: Scalability
• Wide-area Internet
communication
– Cannot ignore
network effects
( Requirement:
Network-awareness
• Long-running, 24/7 service
– Must adapt to changing conditions and failure
( Requirement:
R
i
t Fault-tolerance
F lt t l
4
Why is Building Internet-Scale Systems Hard?
• Scalability is hard to achieve
– How to organise 1000s of processing hosts?
– What is the programming model?
• Applications must be intelligent about network use
– How can we achieve application requirements?
– Lead to data loss,
loss resource shortages
shortages,
inconsistency
• PlanetLab: 630 healthy machines
outt off 1041 ttotal
t l (May’09)
• Google: 1 failure per hour in 10,000 node clusters
source: Google
• Continuous network, node failures
5
High-level Abstractions Help
• Google uses several layers of abstraction
– Runs applications (search, mail, ...) on top of highest layer
– Each layer is scalable, network-aware and fault-tolerant
Google
Apps
Google
Apps
Google
Apps
MapReduce computation
BigTable storage system
Chubby
lock
service
Google File System
6
Large-Scale Distributed Systems Group
• Research goal: “Support the design and engineering of scalable
and
a
d robust
obus Internet-wide
e e
de applications”
app ca o s
• Need to provide higher-level abstractions at different layers
– Many success stories from
h exist
i t
research
• e.g. overlay networks, distributed
hash tables, network coordinates,
storage and replication
mechanisms, ...
– Combination of networks,,
distributed systems &
database research
Data management
g
layer
y
Application layer
Network layer
7
Talk Structure
III. Data management layer:
Supporting imperfect data processing
DISSP Project: Dependable Internet-scale stream processing
II. Application layer:
Building adaptive overlay networks
LANC CDN Project: Network/load
Network/load-aware
aware content delivery
I. Network layer:
Improving Internet routing
Ukairo Project: Detour routing for applications
8
I. Improving Internet Routing
• Internet-scale applications want custom communication paths
– Skype wants path with low packet loss
– iTunes wants path with high download rate
• Internet
scheme
te et uses ttwo-level
o e e hierarchical
e a c ca routing
out g sc
e e
a AS 1
AS 4 b
AS3
AS 2
AS 5
AS 6
– Internet hosts part of autonomous systems (ASs)
• Inter-AS routing (BGP) and intra-AS routing (OSPF)
• Internet routing optimises for ISPs’ concerns!
– One path for all applications and no control over returned path
9
Taking Detours on the Internet
• Idea: Take multiple Internet paths and stitch them together
Direct Path
a
AS1
AS2
AS4
AS3
AS6
AS5 b
Detour Path
d
– Resulting detour path may have better properties
• What causes Internet detour paths?
– Inter-AS routing
g not optimal
p
+ limited expressiveness
p
10
Finding Detours in the AS Graph [IPTPS’09]
• Idea: Analyse detours in the Internet AS graph
– Assume that similar AS-level
AS level paths benefit from similar detours
Shared AS link
a
AS1
AS3
AS2
c
AS5
Known good
detour
AS7
AS4
b
AS6 d
Potential g
good
detour
e
– Perform clustering on similarity metric: shared link count
11
Ukairo Project: Detour Routing for Applications
• Deploying general-purpose detour routing plane on PlanetLab
– Continuously searches for Internet detour paths
– Node exchange found detours using gossiping
– Applications can use it transparently, e.g. web browser download
• Open research questions
–
–
–
–
What
What
What
What
is the overhead of finding detour paths?
happens if everybody uses detour routing?
do ISPs think about this?
are the lessons for future Internet designs?
12
Talk Structure
III. Data management layer:
Supporting imperfect data processing
DISSP Project: Dependable Internet-scale stream processing
II. Application layer:
Building adaptive overlay networks
LANC CDN Project: Network/load
Network/load-aware
aware delivery of content
I. Network layer:
Improving Internet routing
Ukairo Project: Detour routing for applications
13
II. Building Adaptive Overlay Networks
• Imagine your start-up idea of “mugbook” becomes an
overnight
o
e g success
success...
mugbook
mugbook
• How do you support such a website?
– Single web server?
– Multiple web servers in single data centre?
14
Content Delivery Networks
• Content delivery networks (CDNs) serve content to many
cclients
e s world-wide
o d de
– Overlay network consists of:
• Distributed set of servers that maintain content replicas
• Clients (web browsers) that request content
15
Mapping Clients to Content Servers
• How do we assign clients to content servers?
– Load awareness
• Don’t direct clients to
overloaded content servers
– Network
N t
k awareness
• Don’t send traffic on
congested network paths
• Many heuristics proposed in the past
– Geographic location
– Clustering of address prefixes
– Proprietary solutions
16
Cost Graph
• Associate each client/server pair with cost
– Use download times from servers as cost metric
• Incorporates load and network congestion
• But: measurement overhead remains high
– Can’t measure all costs – need to estimate missing ones
17
Network Coordinates
• Idea: Assume cost graph embeddable in metric space
– Approximate missing measurements using Euclidean distances
• Assign each client/server a network coordinate C
– Distances between coordinates estimate download costs
| C(Client1) – C(Server1) | = download_time
18
Computing Network Coordinates
• Scalable, decentralised computation (e.g. using Vivaldi algorithm) [Dabek’04]
– 2
2-5
5 dimensions sufficient in practice
– Low measurement overhead
– Continuous process
~1500 web servers with network delay as cost
19
LANC Content-Delivery Network [ROADS’08]
• Use network coordinates to organise content servers and clients
– Clients keep track of content servers in “neighbourhood”
– Map clients to “nearest” content servers in space
• Overloaded content servers “move away”
20
Does it really work? (Yes!)
• Deployed LANC CDN on PlanetLab
• 119 content servers and 16 clients
• Downloaded Linux distribution from 100 web servers world-wide
• Tried several different assignment strategies
1.0
08
0.8
LANC CDN
Nearest
Random
Direct
0.6
CDF
0.4
02
0.2
0.0
10
100
1000
10000
Transfer data rate per request (KB/s)
21
Talk Structure
III. Data management layer:
Supporting imperfect data processing
DISSP Project: Dependable Internet-scale stream processing
II. Application layer:
Building adaptive overlay networks
LANC CDN Project: Network/load
Network/load-aware
aware delivery of content
I. Network layer:
Improving Internet routing
Ukairo Project: Detour routing for applications
22
III. Supporting Imperfect Data Processing
• Global sensing infrastructures
Users
Mobile
sensing
devices
Applications
Traffic
monitors
Data collection,
f sion
fusion,
aggregation &
dissemination
Scientific
instruments
RFID
tags
g
Cameras
Body
sensor
networks
Webfeeds
Embedded
sensors
Wireless sensor
networks
Web content
– Runs continuous queries over sensor streams
– Failure takes out resources
23
Stream Data Model
• Data sources emit streams of data tuples
– Tuples contain schema
with fields
ts
coord
image
ts
coord
image
ts
coord
image
ts
coord
image
ts
coord
image
• User submit declarative queries
– Range of operators (filter,
join, transform, ...) process
data tuples
image
merging
operator
coordinate
transform
f
operator
coordinate
transform
f
operator
24
Failure Recovery in Stream Processing
• Use redundant resources to achieve dependability
image
merging
operator
coordinate
transform
operator
image
merging
operator
coordinate
transform
operator
– Run multiple copies of same query operator
• But: Internet-scale system may have not enough spare resources
– Instead accept degradation in processing quality
• Idea: Enhance stream data model to include quality information
25
Quality-Centric Stream Data Model
• Enhance data tuples with:
D8
D7
data weight recall
3
D8
2
D7
3 0.83
1
2 0.75
1
D1
1
D3
1
D5
1
D1
1 1
D3
1 1
D5
1 1
D2
1
D4
1
D6
1
D2
1 1
D4
1 1
D6
1 1
Weight
Number of data sources in tuples
Recall
Fraction of received tuples
26
What is it Good for?
• Provide feedback about result quality to users
– Measure of how much data made it into the result tuple
• Allow system to handle node and network failures
1. Proactive operator replication
• Invest resources where failure impact highest
2. Reactive failure recovery
• Decide based on lost recall if recovery worthwhile
• Support for smart load-shedding under resource shortage
– Discard tuples with lowest impact on overloaded processing nodes
27
DISSP Project:
Dependable Internet-Scale Stream Processing
• Currently building prototype system
– Anybody will be able to connect sensor sources + run queries
– System provide best effort service given
available resources
Users
Applications
Mobile
sensing
devices
Traffic
monitors
Data collection,
fusion,
aggregation &
dissemination
Scientific
instruments
RFID
tags
Cameras
p questions
q
• Open
Body
sensor
networks
Webfeeds
Embedded
sensors
Wireless sensor
networks
Web content
– What’s the right data model for processing sensor data?
– How to discovery data sources in a scalable fashion?
– How to perform query optimisation at a global scale?
28
Research Outlook
• Programming model
– What are the right abstractions for building Internet
Internet-scale
scale systems?
• Need richer Internet interface – not just send(packet,dest_IP)
– How do we build robust cloud applications?
• Currently too much focus on low-level services
• System management
– How do we provision Internet-scale systems?
• Scale up/down
p/
for sudden rise in p
popularity
p
y – “flash crowds”
• Testing and evaluation
– How do we test, debug and evaluate Internet-scale systems?
• Hard to obtain reproducible results from PlanetLab experiments
29
Conclusions
( Internet-scale apps have new network requirements
– “One
One size doesn’t
doesn t fit all”
all – but it
it’ss hard to change the Internet
Ukairo: Overlay networks can provide custom routing
( Internet-scale
Internet scale systems need new overlay abstractions
– Apply geometric algorithm to solve distributed systems problems
LANC CDN: Metric space for node organisation in CDN
( Internet-scale systems require new data models
– Unrealistic
U
li ti to
t expectt perfect
f t processing
i
– Instead accept failure and overload as a fact of life
DISSP: Make impact of failure on processing explicit
Thank You! Any Questions?
Peter Pietzuch
<[email protected]>
http://platypus.doc.ic.ac.uk
30