GRID Applications - Department of Computing

GRID Applications
Tuğba Taşkaya Temizel
20 February 2006
Problems Where Grids Have
Been Successful
 Megacomputing problems: The problems
are divided into parallel independent parts.
 Mega and seamless access problems:
Integrate access, Use of multiple data and
resources.
 Loosely coupled nets: Functionally
decomposed sequential problems.
Grid Applications
 Community centric: Get the organisations
together for collaboration.
 Data-centric: Integration of multiple
resources
 Compute-centric: Certain coupled
applications and seamless access to
multiple back-end hosts
 Interaction-centric: Corresponds to
problems requiring real-time responses
Application Fields








Astronomy
Bioinformatics
Environmental Science
Particle physics
Medicine and Health
Social Sciences
Combinatorial Chemistry
….
ASTRONOMY
Virtual Observatory
TOTAL BUDGET : $ 20
million (US)
DURATION : 2002-2005
TYPE : INTERNATIONAL
URL : http://www.ivoa.net
ASTRONOMY
Virtual Observatory
 Objective:To facilitate the international
coordination and collaboration necessary
for the development and deployment of
the tools, systems and organizational
structures necessary to enable the
international utilization of astronomical
archives as an integrated and
interoperating virtual observatory.
ASTRONOMY
Virtual Observatory
 Data creators




create the data and store in archive
describe process of data creation in standard modelling terms
describe data products according to IVOA standards
implement automated publication and registration mechanism
 Data providers:





enable web access to archives
choose data products to be published
register data products with IVOA
support discovery/query services on data products
support federation
 Service providers:
 implement data discovery/query/analysis/creation services
 enable web access to results of these services
ASTRONOMY
Virtual Observatory
 Problems:
 One common data format structure: Translation
mechanisms exist. Each data provider should
advertise their data format. HDF5 format is proposed
has been proposed recently to overcome this
difficulty.
 Query services: Basic queries (query for specific data
product) have been provided but more complex
queries are needed for theoretical results.
 Simulators: Algorithms that create new data, from
previously published data resources
 Modelling/Describing Simulations: Right classification
of simulations (classification in terms of subject, type,
implementation choice, data product.
ASTRONOMY
Virtual Sky
PARTNERS: Caltech Center for
Advanced Computing Research
Johns Hopkins University
the Sloan Sky Survey
Microsoft Research
PORTED TO TERAGRID
URL : http://virtualsky.org
ASTRONOMY
Virtual Sky
 Provides seamless, federated images of the
night sky; not just an album of popular places,
but also the entire sky at multiple resolutions
and multiple wavelengths
 Federates many different image sources into a
unified interface
 Architecture is based on a hierarchy of
precomputed image tiles(mosaic), so that
response is fast.
ASTRONOMY
Virtual Sky
 Problem: Demand for high computational power
for resampling the raw images. For each pixel of
the image, several projections from pixel to sky
and the same number of inverse projections are
required.
 Problem: Federation of the heterogeneous
image resources causes a loss of information
ASTRONOMY
MONTAGE
Partners: California Institute of Technology,
Nasa, Caltech University
Duration: 2002-2005
URL : http://montage.ipac.caltech.edu/
PORTED TO TERAGRID
ASTRONOMY
MONTAGE
 Comprehensive mosaicking
system that allows broad choice
in the resampling and photometric
algorithms
 Offer simultaneous, parallel
processing of multiple images to
enable fast, deep, robust source
detection in multi-wavelength
image space.
ASTRONOMY
MONTAGE
 Data fetched from the
most convenient place
 Computing is done at
any available platform
 Replica Management:
Intermediate products
are cached for reuse
 Virtual Data: User
specifies the desired
data using domain
specific attributes and
not by specifying how to
derive the data
ASTRONOMY
QUEST
Partners: Yale University, Indiana University, Centro de
Investigaciones de Astronomía, Universidad de Los Andes
URL :
http://hepwww.physics.yale.edu/www_info/astro/quest.html
ASTRONOMY
QUEST
Objectives:
 Transient gravitational lensing: This will lead to a better
understanding of the nature of the non-luminous mass of
the Galaxy.
 Quasar gravitational lensing: At much larger scales than
our Galaxy, the Quest team hopes to detect strong
lensing of very remote objects such as quasars.
 Supernovae: The Quest system will be able to detect
large numbers of very distant supernovae, leading to
prompt follow-up observations, and a better
understanding of supernova classification, as well as
their role as standard candles for understanding the
early Universe.
 Gamma-ray burst (GRB) afterglows: Quest will search
for these fading sources, and try to correlate them with
known GRBs.
ASTRONOMY
QUEST
Architecture:
COMBINATORIAL CHEMISTRY
COMB-E-CHEM
 Partners: Southampton Chemistry




Department, Mathematics, ECS, Bristol
Chemistry with backing Pfizer, Roche and
IBM
£2.2M project
Started in 2001
National e-science Pilot project
URL: http://www.combechem.org
COMBINATORIAL CHEMISTRY
COMB-E-CHEM
 Objective: Develop new ways of collaborative
working over the Grid to handle the hugely
increasing flow of information on molecular and
crystal structures arising from the application of
Combinatorial Chemistry.
 Facilitate the understanding of how molecular
structure influences the crystal and material
properties.
COMBINATORIAL CHEMISTRY
COMB-E-CHEM
HIGHER ENERGY PHYSICS
Goals
 Find the mechanism responsible for mass
in the universe, and the “Higgs” particles
associated with mass generation, as well
as the fundamental mechanism that led to
the predominance of matter over
antimatter in the observable cosmos.
HIGHER ENERGY PHYSICS
Challenges
 Providing rapid access to data subsets drawn
from massive data stores , rising from petabytes
in 2002 to ~100 petabytes by 2007, and exabtes
(1018 bytes) by approximately 2012 to 2015.
 Providing secure, efficient, and transparent
managed access to heterogeneous worldwidedistributed computing and data-handling
resources, across an ensemble of networks of
varying capability, and reliability.
HIGHER ENERGY PHYSICS
Challenges
 Tracking the state and usage patterns of
computing and data resources in order to make
possible rapid turnaround as well as efficient
utilisation of global resources
 Providing the collaborative infrastructure that will
make it possible for physicists to contribute
effectively.
 Building regional, national, continental, and
transoceanic networks, with bandwidths rising
from the gigabit per second to the terabit per
second range over the next decade.
HIGHER ENERGY PHYSICS
Grid projects
 PPDG (Particle Physics Data Grid)
 GriPhyN (Grid Physics Network)
 iVDGL (International Virtual Data Grid
Laboratory)
 DataGrid
 LCG (Large Hadron Collider
Computing Grid)
 CrossGrid
HIGHER ENERGY PHYSICS
PPDG (Particle Physics Data Grid)
 Formed in 1999
 Objective: To address the need for Data
Grid services to enable the worldwidedistributed computing model of current and
future high-energy and nuclear physics
experiments.
 URL: www.ppdg.net
HIGHER ENERGY PHYSICS
GriPhyN (Grid Physics Network)
 Objective: Focused on the creation of
Petabyte Virtual Data Grids that meet the
data-intensive computational needs of a
diverse community of thousands of
scientists spread across the globe.
 URL: (http://www.griphyn.org)
HIGHER ENERGY PHYSICS
iVDGL(International Virtual Data Grid
Laboratory)
 The iVDGL is tasked with establishing and
utilizing an international Virtual-Data Grid
Laboratory (iVDGL) of unprecedented scale and
scope, comprising heterogeneous computing
and storage resources in the U.S., Europe and
ultimately other regions linked by high-speed
networks, and operated as a single system for
the purposes of interdisciplinary experimentation
in grid-enabled, data-intensive scientific
computing.
 URL: http://www.ivdgl.org/
HIGHER ENERGY PHYSICS
Goals
 Deploy a Grid laboratory
 Support research mission of data intensive experiments
 Provide computing and personnel resources at university sites
 Provide platform for computer science technology development
 Prototype and deploy a Grid Operations Center (iGOC)
 Integrate Grid software tools
 Into computing infrastructures of the experiments
 Support delivery of Grid technologies
 Hardening of the Virtual Data Toolkit (VDT) and other
middleware technologies developed by GriPhyN and other Grid
projects
 Education and Outreach
 Lead and collaborate with Education and Outreach efforts
 Provide tools and mechanisms for underrepresented groups and
remote regions to participate in international science projects
HIGHER ENERGY PHYSICS
iVDGL Sites (February 2004)
SKC
LBL
Caltech
Boston U
UW Milwaukee
Michigan PSU
UW Madison
BNL
Fermilab
Argonne
Iowa Chicago
J. Hopkins
Indiana
Hampton
ISI
Vanderbilt
UCSD
UF
Austin
Brownsville
Tier1
Tier2
Other
FIU
Partners
EU
Brazil
Korea
HIGHER ENERGY PHYSICS
DataGrid
 DataGrid is a project funded by European Union.
 The objective is to build the next generation
computing infrastructure providing intensive
computation and analysis of shared large-scale
databases, from hundreds of TeraBytes to
PetaBytes, across widely distributed scientific
communities.
 URL: eu.datagrid.webcern.ch
 Duration : 2001- 2003
HIGHER ENERGY PHYSICS
LCG(Large Hadron Collider Computing
Grid)
 The aim to prepare the computing
infrastructure for the simulation,
processing, and analysis of LHC data for
all four of the LHC collaborations.
 URL : http://lcgrid.web.cern.ch
HIGHER ENERGY PHYSICS
Global
CMS Experiment
LHC Data Grid Hierarchy
Online
System
0.1 - 1.5 GBytes/s
Tier 0
10-40 Gb/s
Korea
Tier 1
Russia
UK
Tier 2
1-2.5 Gb/s
CERN Computer
Center
USA
2.5-10 Gb/s
Tier2 Tier2 Tier2 Tier2
Center Center Center Center
Institute Institute Institute Institute
Tier 3
Physics caches
Tier 4
PCs
1-10 Gb/s ~10s of Petabytes/yr by 2007-8
~1000 Petabytes in < 10 yrs?
HIGHER ENERGY PHYSICS
CrossGrid
 Objective: Developing, implementing, and
exploiting new Grid components for interactive
compute- and data-intensive applications such
as simulation and visualization for surgical
procedures, flooding crisis team decisionsupport systems, distributed data analysis in
high-energy physics, and air pollution combined
with weather forecasting.
 URL: www.crossgrid.org
HIGHER ENERGY PHYSICS
The CrossGrid architecture
Applications
Supporting Tools
1.1
1.1
BioMed
BioMed
2.2
2.2 MPI
MPI
Verification
Verification
1.2
1.2
Flooding
Flooding
2.3
2.3
Metrics
Metrics and
and
Benchmarks
Benchmarks
Applications
Development
Support
App. Spec
Services
Generic
Services
Fabric
MPICH-G
MPICH-G
1.1
1.1
User
Interaction
User Interaction
Services
Services
3.2
3.2
Scheduling
Scheduling
Agents
Agents
DataGrid
DataGrid Job
Job
Submission
Submission
Service
Service
Resource
Resource
Manager
Manager
(CE)
(CE)
CPU
CPU
1.3
1.3 Interactive
Interactive
Distributed
Distributed
Data
Data Access
Access
1.3
1.3 Data
Data
Mining
Mining on
on
Grid
Grid (NN)
(NN)
2.4
2.4
Performance
Performance
Analysis
Analysis
3.1
3.1 Portal
Portal &&
Migrating
Migrating
Desktop
Desktop
1.1,
1.1, 1.2
1.2 HLA
HLA
and
and others
others
1.3
1.3
Interactive
Interactive
Session
Session Services
Services
3.4
3.4
Optimization
Optimization of
of
Grid
Grid Data
Data Access
Access
GRAM
GRAM
Resource
Resource
Manager
Manager
(SE)
(SE)
Secondary
Secondary
Storage
Storage
1.4
1.4
Meteo
Meteo
Pollution
Pollution
GridFTP
GridFTP
1.1
1.1 Grid
Grid
Visualisation
Visualisation
Kernel
Kernel
3.3
3.3
Grid
Grid
Monitoring
Monitoring
GIS
GIS // MDS
MDS
Resource
Resource
Manager
Manager
3.4
3.4
Optimization
Optimization of
of
Local
Local Data
Data Access
Access
Tertiary
Tertiary Storage
Storage
3.1
3.1
Roaming
Roaming
Access
Access
GSI
GSI
Resource
Resource
Manager
Manager
Instruments
Instruments
(( Satelites,
Satelites,
Radars)
Radars)
Globus-IO
Globus-IO
DataGrid
DataGrid
Replica
Replica
Manager
Manager
Globus
Globus
Replica
Replica
Manager
Manager
Replica
Replica
Catalog
Catalog
Replica
Replica
Catalog
Catalog
BIOINFORMATICS
Challenges
 To provide a usable and accessible
computational and data management
environment
 To provide sufficient support services
 To ensure that the science performed on the grid
constitutes the next generation of advances
 To accept feedback from bioinformaticians and
to improve the next generation of infrastructure
BIOINFORMATICS
Grid Applications
 CEPAR(Combinatorial Extension in
PARallel) and CEPort – 3D protein
structure comparison
 Chemport – a quantum mechanical
biomedical framework
BIOINFORMATICS
Cepar:a computational biology application
 A typical protein consists of 300 of one of 20 of
amino acid  a total of 20300 possibilities.
 with 30000 protein chain in PDB (Protein Data
Bank), and each pair takes 30s to compare, (30k
* 30k /2) *30s size  428 CPU years on one
processor.
 Strategy: data reduction, data optimization,
efficient scheduling  CE (Combinatorial
Extension) algorithm 1000 CPU of 1.7 Teraflop
IBM Blue Horizon solved in few days
BIOINFORMATICS
Chemport: a computational chemistry framework
 Chemistry computation for general atomic
molecular and Electronic Structure System
 Computational and functional analysis in
biomolecular via classical and quantum
mechanical simulation
BIOINFORMATICS
eDiamond
 A Grid-enabled federated database of
annotated mammograms
 eDiaMoND is a collaborative project
funded through an EPSRC grant and
IBM's SUR grant
 URL : www.ediamond.ox.ac.uk
BIOINFORMATICS
ediamond goals
 It has a significantly large distributed database
of mammograms (400 cases per site with a
majority annotated).
 It aligns with and complies with new IT policies
for the NHS in that it is secure and wins the
confidence of the relevant legal, ethical and
NHS Trust IT officers. In addition, the system will
follow all known guidelines for the deployment of
NHS patient and health records.
 It is scalable and is designed in such a way that
it could scale to cope conceptually with millions
of images spread around the 90+ Breast Care
Units in the UK.
BIOINFORMATICS
ediamond goals
 It is effective in that it is fast, it is useful to the
clinicians in the areas of screening, training,
epidemiology and computer aided detection, and
it is intuitive for the users.
 It must be built such that upgrades of platform or
image analysis software are graceful.
 It is reusable, in that the platform could be used
as a foundation for other e-health projects.
 It is based on Grid architecture.
Grid Applications
What new challenges do these application
represent?
• Are there new paradigms and problems here?
Case Study: News Service
Application
 Problem:
 The underlying application is to be used by
News Service organization whose purpose is
to electronically publish news bulletin
messages to various subscribers. The News
Service organization publishes bulletin
messages within various categories, such as
Business News, Sports, and Weather.
Case Study: News Service
Application
 Tasks:
 Writers gather news and submit the news bulletins for approval
via this application
 Editors are informed of any pending bulletins that the writers
have submitted. The editors log on to the application, are
authenticated by the application and retrieve the pending news.
Upon review of the news bulletins, they either approve or
disapprove of the news bulletins submitted by the writers. All
approved news bulletins are subsequently published by the
application to all registered subscribers.
 Administrator is responsible for starting and stopping the
application and performing other necessary administrative
functions.
 Service organization allows other business partner organizations
to submit news bulletins. Upon receipt of news bulletins from the
business partner organizations, the administrator loads the news
bulletins into the application for further review by the editor and
publishing to the subscribers.
Case Study: News Service
Application
 The System Context:
Case Study: News Service
Application
 The use cases:
Case Study: News Service
Application
 The architecture overview: