Building Scalable, High Performance Cluster and Grid Networks

Building Scalable, High Performance Cluster
and Grid Networks: The Role Of Ethernet
Thriveni Movva
CMPS 5433
Overview








About Grids/Clusters
Uses of Grid
Differences between Grids/Clusters
Benefits of Grid
Grid Architecture
Building Ethernet Network for Grids/Clusters
Examples of Ethernet Grids/Clusters
Conclusion/Summary
What Is A Grid Computer?
 Hardware and Software System
 Integrates a collection of distributed system components
 Computer systems
 Storage etc
 Solves large-scale computation problems
 Appear to the user as a single, large “Virtualized” computing system
 Consists of geographically dispersed computers
What is a Cluster?





Multiprocessor system consisting of co-located computers and storage
Viewed as though it were a single computer
Connected through fast local area networks (Localized within a room or building)
Provides more speed and/or reliability than a single computer
Cost-effective than single computers of comparable speed or reliability.
Uses of Grid Computing
 Computer systems and other resources
 not constrained to be dedicated to individual users or applications
 Can be made available for dynamic pooling/sharing according to the changing needs
 Using internet, Grid-based resource sharing and collaborative problem solving can be
extended to multi-institutional “Virtual Organizations”
Differences between Grids/Clusters
 Grids:
•
•
•
•

dispersed over a local/metropolitan/WAN
span administrative boundaries
focus on problems in distributing computing/resource sharing
distribute workloads among different machine types and OS
Clusters:
•
•
•
•
localized within a room/building
single administration
focus on compute-intensive problems and HPC
homogenous (single type of processor and OS)
Benefits Of The Grid
 Grid Computing offers a number of Potential uses and benefits that can be broadly
categorized in the following way:
 High Performance Computing (HPC)
 Data Federation and Collaboration
 Resource Allocation and Optimization
High Performance Computing (HPC)




Computationally intensive parallelizable applications can be benefited
Uses computer array of numerous commodity or specialized systems
Most applications of the Grid fall into HPC classification
Advantages Of HPC:




Cost effective solutions to critical problems
High return on investment
Solves problems that were previously insolvable within given time and cost
Solve problems too large for conventional supercomputers
 Fields in which the HPC Grid has successfully addressed a wide range of
computational problems include:
 Climate/weather/ocean modeling and simulation, Internet search engines, Signal/image
processing, Pharmaceutical research, Military forces simulation
Data Federation and Collaboration




Consolidates data from different sources in a single data service
Hides data location, local ownership and infrastructure from the application
No data disruption by local users, applications or data management policies
Facilitates wide range of integrated applications like:




Corporate performance dashboards
Marketing analysis tools
Customer service applications
Data mining applications
Resource Allocation and Optimization
 Sharing of computing and storage to improve resource utilization
 For Example, the applications and the batch jobs can be transferred to an idle server
 Benefits of resource optimization
 Reclaims much of the stranded capacity of the computing infrastructure
 Reduces the level of capital investment
 No modification of existing application required
Grid Computing Architecture
 Basic architecture of Grid consists of





User Interface
Applications
Grid Middleware
Computing Resources
Grid Network
Applications
 Classification of parallel applications
 Embarrassingly Parallel Computations (EPC)
•
•
•
•
Divided into independent parts
Allocated to multiple processors for simultaneous execution
No communication is required between the processors
Example : Testing large integers to determine prime numbers
 Parametric and Data Parallel Computations
•
•
•
•
Also referred to as Nearly Embarrassingly Parallel Computations (NEPC)
Each processor works on independent subset of the data
Data is later gathered by a single process
Examples: Internet search engines
 Loosely Coupled Synchronous Parallel Computations
• Inter-process communication between small subset of processors before the computation can be
completed
Grid Middleware






Gives the Grid the semblance of a single computer system
Provides coordination among computing resources of the Grid
Provides location transparency
Allows the applications to run over a virtualized layer of networked resources
Available from system vendors and independent software vendors
Example: Globus Toolkit
Functions of Middleware
 Discovery and monitoring
 Discover what resources or services are available
 Monitor their status
 Resource allocation and management
 Matches application requirements to the available computing resources
 Creates and schedules remote jobs as required
 Ensures optimum load balancing and resource utilization
 Security
 Shared resources may contain sensitive information
 Secures communications, authenticate user identities using SSL/TLS etc
 Message Passing System
 Used by compute-intensive parallel applications for inter-process communication
 Examples: MPI (Message passing interface) and PVM (parallel virtual machine)
Ethernet Networks for Clusters and Grids
 Single-switch Clusters
 Large Clusters
 Ethernet Grid Networks
Single-switch Clusters
 Built using a single high-availability
Gigabit Ethernet switch/router as the
cluster interconnect
 The maximum size of a single-switch
Ethernet cluster is determined by the
non-blocking port capacity of the
switch
 Current Switch/routers provide
interconnect for over 600 GbE
connected servers
 All server ports configured to be in
same subnet
Large Clusters
 Built using meshes of Federated
Ethernet switches
 Ethernet switches use non-blocking,
constant Bi-sectional Bandwidth
(CBB) topologies
 CBB
 Provides scalability to support
thousands of cluster nodes
 Provide high bandwidth
connectivity to the network
 The core of the cluster provides
each node switch with equal load
share to avoid blocking of ports
Ethernet Grid Networks
(Campus Grid network based on Ethernet switching)
 Ethernet allow the cluster to
participate in a broader campus or
Enterprise Grid structure
 Desktop computers, workstations
connected to the campus grid
network using gbE
 Server farms Outside of cluster are
connected to site switches using gbE
 Goal of campus LANs
 gives high priority to general Grid
traffic
 ensures critical Grid traffic does not
incur any added latency
Grid Tools
 Tools used to prioritize critical grid traffic
 Priority Queuing
• The forwarding capacity of a congested port is immediately allocated to any high
priority traffic that enters the queue
 Rate limiting and policing
• Limits the amount of lower priority traffic that enters the network
 Weighted Random Early Discard (WRED)
 Packet loss can be eliminated if buffers are never allowed to fill to capacity with
resulting overflows
 Overflows can be avoided by applying WRED to the lower priority traffic
 WRED eliminates the possibility of high priority packets arriving at a buffer that is
already overflowing with lower priority packets
Examples of Ethernet Cluster/Grids
 TeraGrid
 Is a multi-institutional effort to build and deploy world’s most comprehensive computing
infrastructure for open scientific research
 NASA
 NASA uses ESDCD “Grid of clusters”, to help scientists increase their understanding of the
Earth, the solar system and the universe through computational modeling and processing of
space-borne observations
Conclusion/Summary
 Ethernet continues to evolve as a highly cost-effective and flexible technology
 Majority of parallel and general Grid applications are very well served by the
performance characteristics of Ethernet as the cluster/Grid interconnect
 In the future, Ethernet end-to-end data transfer bandwidths, message latencies and CPU
utilization will improve dramatically due to NIC enhancements
 Volume production leading to price decline
 These developments expected to improve the overall performance of existing Ethernet
clusters/Grids and use of cluster/Grid technology by a broader range of commercial
enterprises