TACC PowerPoint Template

Deployment of NMI Components
on the UT Grid
Shyamal Mitra
TEXAS ADVANCED COMPUTING CENTER
Outline
• TACC Grid Program
• NMI Testbed Activities
• Synergistic Activities
– Computations on the Grid
– Grid Portals
2
TACC Grid Program
• Building Grids
– UT Campus Grid
– State Grid (TIGRE)
• Grid Resources
– NMI Components
– United Devices
– LSF Multicluster
• Significantly leveraging NMI Components and
experience
3
Resources at TACC
•
•
•
•
•
•
IBM Power 4 System (224 processors, 512 GB Memory, 1.16 TF)
IBM IA-64 Cluster (40 processors, 80 GB Memory, 128 GF)
IBM IA-32 Cluster (64 processors, 32 GB Memory, 64 GF)
Cray SV1 (16 processors, 16 GB Memory, 19.2 GF)
SGI Origin 2000 (4 processors, 2 GB Memory, 1 TB storage)
SGI Onyx 2 (24 processors, 25 GB Memory, 6 Infinite Reality-2
Graphics pipes)
• NMI components Globus and NWS installed on all systems save the
Cray SV1
4
Resources at UT Campus
• Individual clusters belonging to professors in
– engineering
– computer sciences
– NMI components Globus and NWS installed on
several machines on campus
• Computer laboratories having ~100s of PCs in
the engineering and computer sciences
departments
5
Campus Grid Model
• “Hub and Spoke” Model
• Researchers build programs on their clusters
and migrate bigger jobs to TACC resources
– Use GSI for authentication
– Use GridFTP for data migration
– Use LSF Multicluster for migration of jobs
• Reclaim unused computing cycles on PCs
through United Devices infrastructure.
6
UT Campus Grid Overview
LSF
7
NMI Testbed Activities
• Globus 2.2.2 – GSI, GRAM, MDS, GridFTP
– Robust software
– Standard Grid middleware
– Need to install from source code to link to other
components like MPICH-G2, Simple CA
• Condor-G 6.4.4 – submit jobs using GRAM,
monitor queues, receive notification, and
maintain Globus credentials. Lacks
– scheduling capability of Condor
– checkpointing
8
NMI Testbed Activities
• Network Weather Service 2.2.1
–
–
–
–
name server for directory services
memory server for storage of data
sensors to gather performance measurements
useful for predicting performance that can be used for a
scheduler or “virtual grid”
• GSI-enabled OpenSSH 1.7
– modified version of OpenSSH that allows login to remote
systems and transfer files between systems without entering a
password
– requires replacing native sshd file with GSI-enabled OpenSSH
9
Computations on the UT Grid
• Components used – GRAM, GSI, GridFTP, MPICH-G2
• Machines involved – Linux RH (2), Sun (2), Linux Debian
(2), Alpha Cluster (16 processors)
• Applications run – PI, Ring, Seismic
• Successfully ran a demo at SC02 using NMI R2
components
• Relevance to NMI
– must build from source to link to MPICH-G2
– should be easily configured to submit jobs to schedulers like
PBS, LSF, or Loadleveler
10
Computations on the UT Grid
• Issues to be addressed on clusters
– must submit to local scheduler: PBS, LSF or
Loadleveler
– compute nodes on subnet; cannot communicate with
compute nodes on another cluster
– must open ports through firewall for communication
– version incompatibility – affects source code that are
linked to shared libraries
11
Grid Portals
• HotPage – web page to obtain information on
the status of grid resources
– NPACI HotPage (https://hotpage.npaci.edu)
– TIGRE Testbed portal (http://tigre.hipcat.net)
• Grid Technologies Employed
– Security: GSI, SSH, MyProxy for remote proxies
– Job Execution: GRAM Gatekeeper
– Information Services: MDS (GRIS + GIIS), NWS,
Custom information scripts
– File Management: GridFTP
12
13
GridPort 2.0 Multi-Application Arch.
(Using Globus as Middleware)
14
Future Work
• Use NMI components where possible in building
grids
• Use Lightweight Campus Certificate Policy for
instantiating a Certificate Authority at TACC
• Build portals and deploy applications on the UT
Grid
15
Collaborators
• Mary Thomas
• Dr. John Boisseau
• Rich Toscano
• Jeson Martajaya
• Eric Roberts
• Maytal Dahan
• Tom Urban
16