Berkeley NOW

NOW and Beyond
Workshop on Clusters and Computational Grids for
Scientific Computing
David E. Culler
Computer Science Division
Univ. of California, Berkeley
http://now.cs.berkeley.edu/
NOW Project Goals
• Make a fundamental change in how we design
and construct large-scale systems
– market reality:
» 50%/year performance growth => cannot allow 1-2 year
engineering lag
– technological opportunity:
» single-chip “Killer Switch” => fast, scalable
communication
• Highly integrated building-wide system
• Explore novel system design concepts in this
new “cluster” paradigm
7/30/98
HPDC Panel
2
Berkeley NOW
• 100 Sun
UltraSparcs
– 200 disks
• Myrinet
SAN
– 160 MB/s
• Fast comm.
– AM, MPI, ...
• Ether/ATM
switched
external net
• Global OS
• Self Config
7/30/98
HPDC Panel
3
Landmarks
Top 500 Linpack Performance List
MPI, NPB performance on par with MPPs
RSA 40-bit Key challenge
World Leading External Sort
9
Minute Sort
Inktomi search engine
8
7
NPACI resource site
6
Gigabytes sorted
•
•
•
•
•
•
5
4
3
2
1
0
SGI Orgin
SGI Power
Challenge
0
10
20
30
40
50
60
70
80
90
100
Processors
7/30/98
HPDC Panel
4
Taking Stock
• Surprising successes
–
–
–
–
–
virtual networks
implicit co-scheduling
reactive IO
service-based applications
automatic network mapping
• Surprising unsuccesses
– global system layer
– xFS file system
• New directions for Millennium
– Paranoid construction
– Computational Economy
– Smart Clients
7/30/98
HPDC Panel
5
Fast Communication
16
14
12
g
L
Or
Os
µs
10
8
6
4
2
U
ltr
a
ar
ag
on
M
ei
ko
P
10
O
W
SS
N
W
O
N
U
lt
P ra
ar
ag
on
M
ei
ko
W
O
N
N
O
W
SS
10
0
• Fast communication on clusters is obtained
through direct access to the network, as on MPPs
• Challenge is make this general purpose
– system implementation should not dictate how it can be used
7/30/98
HPDC Panel
6
Virtual Networks
• Endpoint abstracts the notion of “attached to the
network”
• Virtual network is a collection of endpoints that
can name each other.
• Many processes on a node can each have many
endpoints, each with own protection domain.
7/30/98
HPDC Panel
7
How are they managed?
• How do you get direct hardware access for
performance with a large space of logical
resources?
• Just like virtual memory
– active portion of large logical space is bound to physical
resources
Host
Memory
Process n
Processor
***
Process 3
Process 2
Process 1
NIC
Mem
7/30/98
HPDC Panel
P
Network Interface
8
Network Interface Support
Frame 0
Transmit
• NIC has endpoint frames
• Services active
endpoints
• Signals misses to driver
– using a system endpont
Receive
Frame 7
EndPoint Miss
7/30/98
HPDC Panel
9
Communication under Load
Msg
burst
work
Client
Server
Server
Server
Client
Client
continuous
1024 msgs
2048 msgs
4096 msgs
8192 msgs
16384 msgs
70000
60000
50000
40000
30000
20000
28
25
22
19
16
13
10
7
0
4
10000
1
Aggregate msgs/s
80000
Number of virtual networks
=> Use of networking resources adapts to demand.
7/30/98
HPDC Panel
10
Implicit Coscheduling
A
GS
GS
GS
GS
LS
LS
LS
LS
A
A
A
A
A
• Problem: parallel programs designed to run in
parallel => huge slowdowns with local scheduling
– gang scheduling is rigid, fault prone, and complex
• Coordinate schedulers implicitly using the
communication in the program
– very easy to build, robust to component failures
– inherently “service on-demand”, scalable
– Local service component can evolve.
7/30/98
HPDC Panel
11
Why it works
• Infer non-local state from local observations
• React to maintain coordination
observation
fast response
delayed response
WS 1
implication
partner scheduled
partner not scheduled
sleep
Job A
request
WS 2
Job B
action
spin
block
Job A
response
Job A
spin
WS 3
WS 4
7/30/98
Job B
Job A
Job B
Job A
HPDC Panel
12
Example
• Range of granularity and load imbalance
– spin wait 10x slowdown
7/30/98
HPDC Panel
13
I/O Lessons from NOW sort
• Complete system on every node powerful basis
for data intensive computing
– complete disk sub-system
– independent file systems
» MMAP not read, MADVISE
– full OS => threads
• Remote I/O (with fast comm.) provides same
bandwidth as local I/O.
• I/O performance is very tempermental
– variations in disk speeds
– variations within a disk
– variations in processing, interrupts, messaging, ...
7/30/98
HPDC Panel
14
Reactive I/O
• Loosen data semantics
– ex: unordered bag of records
• Build flows from producers (eg. Disks) to
consumers (eg. Summation)
• Flow data to where it can be consumed
7/30/98
D
A
D
D
A
D
D
A
D
D
A
D
HPDC Panel
Distributed Queue
Adaptive Parallel Aggregation
Static Parallel Aggregation
A
A
A
A
15
Adpative Agr.
Adpative Agr.
Static Agr.
Static Agr.
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
% of Peak I/O Rate
% of Peak I/O Rate
Performance Scaling
0
5
10
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
15
0
5
10
15
Nodes Perturbed
Nodes
• Allows more data to go to faster consumer
7/30/98
HPDC Panel
16
Service Based Applications
Transcend Transcoding Proxy
Service request
Front-end
service threads
Manager
User Profile
Database
Physical
processor
Caches
• Application provides services to clients
• Grows/Shrinks according to demand, availability,
and faults
7/30/98
HPDC Panel
17
On the other hand
• Glunix
– offered much that was not available elsewhere
» interactive use, load balancing, transparency (partial), …
– straightforward master-slaves architecture
– millions of jobs served, reasonable scalability, flexible
partitioning
– crash-prone, inscrutable, unaware, …
• xFS
– very sophisticated co-operative caching + network RAID
– integrated at vnode layer
– never robust enough for real use
Both are hard, outstanding problems
7/30/98
HPDC Panel
18
Lessons
• Strength of clusters comes from
– complete, independent components
– incremental scalability (up and down)
– nodal isolation
• Performance heterogeneity and change are
fundamental
• Subsystems and applications need to be reactive
and self-tuning
• Local intelligence + simple, flexible composition
7/30/98
HPDC Panel
19
Millennium
Business
SIMS
BMRC
C.S.
Chemistry
E.E.
Biology
Gigabit Ethernet
Astro
NERSC
M.E.
Physics
N.E.
IEOR
•
•
•
•
C. E.
MSME
Transport
Economy Math
Campus-wide cluster of clusters
PC based (Solaris/x86 and NT)
Distributed ownership and control
Computational science and internet systems testbed
7/30/98
HPDC Panel
20
Paranoid Construction
• What must work for RSH, dCOM, RMI, read, …?
• A page of C to safely read a line from a socket!
=> carefully controlled set of cluster system op’s
=> non-blocking with timeout and full error checking
–
even if need a watcher thread
=> optimistic with fail-over of implementation
=> global capability at physical level
=> indirection used for transparency must track fault
envelope, not just provide mapping
7/30/98
HPDC Panel
21
Computational Economy Approach
• System has a supply of various resources
• Demand for resources revealed in price
– distinct from the cost of acquiring the resources
• User has unique assessment of value
• Client agent negotiates for system resources on
user’s behalf
– submits requests, receives bids or participates in auctions
– selects resources of highest value at least cost
7/30/98
HPDC Panel
22
Advantages of the Approach
• Decentralized load balancing
– according to user’s perception of importance, not system’s
– adapts to system and workload changes
• Creates Incentive to adopt efficient modes of use
–
–
–
–
maintain resources in usable form
avoid excessive usage when needed by others
exploit under-utilized resources
maximize flexibility (e.g., migratable, restartable applications)
• Establishes user-to-user feedback on resource usage
– basis for exchange rate across resources
• Powerful framework for system design
– Natural for client to be watchful, proactive, and wary
– Generalizes from resources to services
• Rich body of theory ready for application
7/30/98
HPDC Panel
23
Resource Allocation
Stream of (incomplete)
Client Requests
Stream of (partial, delayed, or incomplete)
resource status information
Allocator
• Traditional approach allocates requests to
resources to optimize some system utility function
– e.g., put work on least loaded, most free mem, short queue, ...
• Economic approach views each user as having a
distinct utility function
– e.g., can exchange resource and have both happy!
7/30/98
HPDC Panel
24
Pricing and all that
• What’s the value of a CPU-minute, a MB-sec, a
GB-day?
• Many iterative market schemes
– raise price till load drops
• Auctions avoid setting a price
– Vikrey (second price sealed bid) will cause resources to go to
where they are most valued at the lowest price
– In self-interest to reveal true utility function!
• Small problem: auctions are awkward for most
real allocation problems
• Big problem: people (and their surrogates) don’t
know what value to place on computation and
storage!
7/30/98
HPDC Panel
25
Smart Clients
• Adopt the NT “everything is two-tier, at least”
– UI stays on the desktop and interacts with computation “in
the cluster of clusters” via distributed objects
– Single-system image provided by wrapper
• Client can provide complete functionality
– resource discovery, load balancing
– request remote execution service
• Flexible appln’s will monitor availability and
adapt.
• Higher level services 3-tier optimization
– directory service, membership, parallel startup
7/30/98
HPDC Panel
26
Everything is a service
• Load-balancing
• Brokering
• Replication
• Directories
=> they need to be cost-effective or client will fall
back to “self support”
– if they are cost-effective, competitors might arise
• Useful applications should be packaged as
services
– their value may be greater than the cost of resources
consumed
7/30/98
HPDC Panel
27
Conclusions
• We’ve got the building blocks for very interesting
clustered systems
– fast communication, authentication, directories, distributed
object models
• Transparency and uniform access are
convenient, but...
• It is time to focus on exploiting the new
characteristics of these systems in novel ways.
• We need to get real serious about availability.
• Agility (wary, reactive, adaptive) is fundamental.
• Gronky “F77 + MPI and no IO” codes will
seriously hold us back
• Need to provide a better framework for cluster
applications
7/30/98
HPDC Panel
28