Texas Advanced Computing Center

UT Grid: Building a campus grid
Ashok Adiga, Ph.D.
Distributed & Grid Computing Group
Texas Advanced Computing Center
The University of Texas at Austin
[email protected]
(512) 471-8196
TACC Grid Program
• TACC involved in several Grid projects
– Campus Grid (UT Grid, partially funded by IBM)
– State Grid (TIGRE)
– National Grid (ETF)
• Grid Hardware Resources
– Wide range of hardware resources available to research community
at UT and partners
• Grid Software Resources
– Significantly leverage NMI GRIDS components (Globus Toolkit,
GPT, MyProxy, Gridport, GridFTP, …)
– Other software where necessary
• Resource managers (Condor, LSF, PBS, United Devices)
• Schedulers (Condor, Community Scheduling Framework)
TeraGrid (National)
• NSF Extensible Terascale Facility (ETF) project
– build and deploy the world's largest, fastest, distributed computational
infrastructure for general scientific research
– 40 Gbps backbone with hubs in Los Angeles, Chicago & Atlanta
• UT (led by TACC) going online on Teragrid October 1 2004
– 10 Gbps network connection to ETF backbone
– Provide access to high-end computers capable of 6.2 teraflops, a new
terascale visualization system, and a 2.8-petabyte mass storage system
– Provide access to geoscience data collections used in environmental,
geological climate and biological research:
•
•
•
•
high-resolution digital terrain data
worldwide hydrological data
global gravity data
high-resolution X-ray computed tomography data
• Current software stack includes: Globus (GSI, GRAM, GridFTP),
MPICH-G2, Condor-G, GPT, MyProxy, SRB
TIGRE (State-wide Grid)
• Texas Internet Grid for Research and
Education
– computational grid to integrate computing &
storage systems, databases, visualization
laboratories and displays, and instruments and
sensors across Texas.
– Funding announced by Gov. Rick Perry at
Internet2
– TIGRE members include several leading state
institutes:
• Rice, Texas A&M, Texas Tech, U of Houston, UT Austin,
UT El Paso, others…
– Initial software stack will use NMI GRIDS
UT Grid Vision: A Powerful, Flexible, and
Simple Virtual Environment for Research
& Education
The UT Grid vision is the creation of a
cyberinfrastructure for research and
education in which people can develop and
test ideas, collaborate, teach, and learn
through applications that seamlessly harness
the diverse campus compute, visualization,
storage, data, and instruments as needed
from their personal systems (PCs) and
interfaces (web browsers, GUIs, etc.).
UT Grid: Develop and Provide a
Unique, Comprehensive
Cyberinfrastructure…
The strategy of the UT Grid project is to integrate…
– common security/authentication
– scheduling and provisioning
– aggregation and coordination
diverse campus resources…
–
–
–
–
–
computational (PCs, servers, clusters)
storage (Local HDs, NASes, SANs, archives)
visualization (PCs, workstations, displays, projection rooms)
data collections (sci/eng, social sciences, communications, etc.)
instruments & sensors (CT scanners, telescopes, etc.)
from ‘personal scale’ to terascale…
– personal laptops and desktops
– department servers and labs
– institutional (and national) high-end facilities
…That Provides Maximum Opportunity
& Capability for Impact in Research,
Education
…into a campus cyberinfrastructure…
–
–
–
–
–
evaluate existing grid computing technologies
develop new grid technologies
deploy and support appropriate technologies for production use
continue evaluation, R&D on new technologies
share expertise, experiences, software & techniques
that provides simple access to all resources…
– through web portals
– from personal desktop/laptop PCs, via custom CLIs and GUIs
to the entire community for maximum impact on
– computational research in applications domains
– educational programs
– grid computing R&D
UT Grid Approach: Leverage
Strengths of Campus Environment
• Like any grid, campus grid must provide services to
simplify use of distributed resources
• But
– Focus must be to support research and/or
education mission of the university
– Campus grid can leverage vast numbers of PCs
and large numbers of clusters
– Campus grid can integrate novel scientific data
collections and research instruments
UT Grid Approach: Leverage
Strengths of Campus Environment
• Important differences from multi-institution grids:
– Staff in one location, can collaborate face-to-face
– ‘Controlled’ network environment
– High-end computing center can lead deployment
• Important differences from enterprise grids
– Researchers generally more independent than in company
– No central IT group governs researchers’ systems
– Usage models driven by different priorities
• Important differences from domain-specific grids
– Might require integration of wider variety of resources
– Must support wider variety of usage models
UT Has Massive Scale and Unique
Deployment Environment
• ACES building is a model for a university grid
– Massive bandwidth
– Multidisciplinary users
– Numerous PC, clusters, visualization systems,
storage resources
• UT main campus + UT research campus can
be model for multi-institution grid
– Separated by true WAN, but UT controls paths
– Massive bandwidth (10GigE) between campus
– TACC controls resources on both campuses
UT Grid Project Team Has
Participation From Several Campus
Departments…
• Additional UT Partners
– Information Technology Services (ITS):
• deploying Roundup clients, will include client s/w in BevoWare
– College of Engineering IT Group:
• deploying Roundup clients
– Center for Instructional Technology (CIT):
• Helped with Web site, will create education content
– Department of Computer Sciences
• integrating Condor flock, partnering in R&D proposals
– Institute for Computational Engineering & Sciences (ICES):
• integrating clusters and Condor flock
..and Participation Will Grow
Significantly as We Enter Production
• Additional Partners Expected in next 6 months
– Mary Wheeler, ICES
• integrating cluster, leading-edge user
– Kamy Sepehrnoori, Dept of Petroleum & Geophysical Eng.
• integrating cluster, leading-edge user
– College of Fine Arts
• providing Roundup clients
– College of Communications
• interested in storage services
– Additional outreach through UT ‘Tech Deans’ Committee
– Additional users through TACC User Services
UT Grid Components
• Grid User Interfaces
– Typical grid interface is via user portals
– Grid User Nodes provide users with command line (shell)
interfaces to the grid
• Grid Resources
– Compute, storage, visualization, instruments
– Grid software must provide security, monitoring, remote
access
• Grid Services
– Authentication (GSI, MyProxy)
– Scheduling (Condor, CSF)
– Data management (SRB, Avaki)
UT Grid: Current Status
• Providing compute services to users today
– Heterogeneous set of cluster resources (LSF, PBS, LoadLeveler,
Condor) and desktop resources (United Devices, Condor)
– Single sign-on access via user portal
– Allocation and support procedures
– Resource monitoring
– Serial and parallel job submission to clusters and desktop
resources
– Evaluation of scheduling technologies (Condor, CSF)
– Evaluating workflow solutions (Pegasus)
• Basic data services
– Reliable File Transfer tool built using GridFTP, NWS, GPIR
– Share data across resources using Avaki data grid
– SRB
• Visualization services coming soon
– Remote interactive visualization, Batch rendering, Computational
Steering
Challenges Include Scale,
Heterogeneity, Purpose, and Policies
• Usage models:
–
–
–
–
–
research vs education (vs. administrative)
ISV apps vs custom apps
Interactive vs batch
Serial codes vs parallel codes
Etc.
• Most are locally managed
–
–
–
–
–
Local policies and procedures
Different priorities
Sense of ownership
Varying expertise levels of administrators
Varying levels of support
UT Grid: Approach to building the grid
• Challenge: Getting scientists to use UT Grid
– Gain confidence that they can meet their
computing goals and benefit from using the grid
– Share their resources by making them available to
other grid users
• Hub & Spoke approach rather than peer
resources
– Leverage existing trust relationships between
TACC and campus research users
– As users become comfortable with grid software,
convince them to share their resources
UT Grid: Logical View
•
Integrate each set of resources
(compute, vis, storage, data)
within TACC first
TACC Compute,
Vis, Storage, Data
(actually spread across two campuses)
UT Grid: Logical View
•
Next add other UT
resources using
same tools and
procedures
TACC Compute,
Vis, Storage, Data
ACES Cluster
ACES Data
ACES PCs
UT Grid: Logical View
GEO Data
•
Next add other UT
resources using
same tools and
procedures
GEO Cluster
TACC Compute,
Vis, Storage, Data
GEO Cluster
ACES Cluster
ACES Data
ACES PCs
UT Grid: Logical View
BIO Data
BIO Instrument
PGE Cluster
GEO Data
•
Next add other UT
resources using
same tools and
procedures
PGE Data
GEO Cluster
TACC Compute,
Vis, Storage, Data
PGE Instrument
GEO Cluster
ACES Cluster
ACES Data
ACES PCs
UT Grid: Logical View
BIO Data
BIO Instrument
PGE Cluster
GEO Data
PGE Data
GEO Cluster
•
Finally negotiate
connections
GEO Cluster
between spokes
for willing participants
to develop a P2P grid.
TACC Compute,
Vis, Storage, Data
PGE Instrument
ACES Cluster
ACES Data
ACES PCs
Accessing UT Grid: Portals vs CLIs
• Choice of portals over command line
interfaces is not universal
– Some researchers prefer to use their current shell
interface to access the grid
• UT Grid supports Grid User Portals (GUPs)
and Grid User Nodes (GUNs)
Why Are GUPs Important?
• Lower the barrier of entry into grid computing
– Easy access to multiple resources through a
single interface
– Simple GUI interface to complex grid computing
capabilities
– Present a “Virtual Organization” view of the Grid
as a whole
UT GUP Infrastructure
• Portal based on
– Grid Portal Toolkit 3 (NMI component)
– Jetspeed Portal infrastructure
• Underlying Grid Middleware
–
–
–
–
Globus
Community Scheduling Framework
Network Weather Service
Soon: Avaki, SRB
UT GUP Capabilities
• Initial GUP capabilities include:
– View information on resources within UT Grid,
including status, load, jobs, queues, etc.
– View network bandwidth and latency between
systems, aggregate capabilities for all systems.
– Submit user jobs and run hosted applications
– Manage files across systems, and move/copy
multiple files between resources with transfer time
estimates
UT Grid User Portal
Job Submission templates
Community Scheduling Framework (CSF)
• Open-source metascheduler written by Platform
Computing
– Distributed under Globus Public License
– Developed using GT3.0.2 and OGSI
– Will be part of future Globus Toolkit distribution
• Schedules jobs across heterogeneous resources
– Advanced reservation support
– Architecture allows pluggable scheduling policies
– Resource Manager Adapters required to convert requests to
local resource manager.
– Dynamic performance information stored in Global
Information Service
UT Grid CSF Configuration
PBS
CSF Server
Web Server
GT3.0
User Portal
GridPort
GT3.0
Queuing
Service
Job
Service
Reservation
Service
PBS
GT3.0
RM
Adapter
for PBS
GT3.0
RM
Adapter
for LSF
Queues implement
customizable scheduling
policies using plug-ins
LSF
LSF
LSF
Why Are GUNs Important?
• Most campus users have PCs for their research &
education projects
– They are used to their local systems
• They also often need additional resources
– They may want more flexibility than a portal provides
– They need to be able to keep doing what they know, issuing
same commands, but reaching additional resources
– They would like access to those resources easily, even
transparently
• The Grid User Node concept is designed to provide
these features and capabilities
Current Linux GUN Software
Users have the option of installing software stack on
their desktops or using “hosted” GUN.
• Linux Red Hat 9.0
• Globus 3.2.1 NMI Release 5
– Ant v1.6.2
– Java J2SE SDK v1.4
– Grid Package Tools v3.2.1 NMI 5
•
•
•
•
GridShell (pre-release version)
Condor
MPICH
United Devices SDK 4.1
– Perl v 5.6
What is GridShell?
• GridShell is an extension of TCSH and BASH
shells
– includes transparent distributed execution and
data transfer features for intra and inter cluster
execution of programs
– Currently supports LSF, Condor and Globus
environments
– Goal is to extend services to match portal services
Grid User Node
UT Grid: Application driven design
• UT Grid design based on user requirements
– Initial user set has been identified
– Monthly meetings, mailing lists
– Interviews to understand use cases
• Initial set of application areas have compute,
storage and visualization requirements
– Computational Fluid Dynamics (Dr. Carey)
– Reservoir modeling (Dr. Wheeler)
– Flood prediction (Dr. Wells & Dr. Maidment)
UT Grid: Education
• Training courses offered 3-4 times/year
– Gridport (offered via Access Grid)
– Running applications using United Devices
– HPC training (MPI apps, tools)
• Courses offered through CS department
– High Performance Computing for scientists (this semester)
– Grid Computing in science and engineering (summer ‘05)
• CIT planning to provide educational content about UT
Grid research applications
NMI Experiences
• TACC has benefited from using NMI
– Easier to install & configure components
– Better documentation & support
– Software is more robust since it has gone through a level of
integration testing
– Exposure to new components (Gridsolve)
– Working with other NMI Testbed members
• Although NMI components are fairly reliable
– They are still evolving, and occasionally cause backward
compatibility issues (e.g. between Globus versions 3.0.2,
3.2, 3.2.1, and 4.0)
• NMI not a complete grid solution
– Components do not address: scheduling, workflow,
accounting, ….
UT Grid Project Team
•
•
•
•
•
•
•
•
•
•
•
Jay Boisseau
Maytal Dahan
Edward Walker
Ashok Adiga
Ashesh Sahib
CJ Barker
Akhil Seth
David Walling
Eric Roberts
Jeff Mausolf (IBM)
Nina Wilner (IBM)
Texas Advanced Computing Center
www.tacc.utexas.edu
(512) 475-9411