Phylogrid

E-science grid facility for
Europe and Latin America
Computational challenges on
Grid Computing for workflows
applied to Phylogeny
R. Isea1, E. Montes2, A J. Rubio-Montero2 and R. Mayo2
1Fundación
IDEA (Venezuela)
2CIEMAT (Spain)
IWPACBB 2009
Salamanca, June 12th, 2009
www.eu-eela.eu
Outline
• Phylogenetics: a reminder
• Challenges in Phylogenetics
– Computational methods: MrBayes
– Exploiting of Grid technology
• MrBayes and Bioinformatic resources on Grid
• The PhyloGrid approach
–
–
–
–
General description and objectives
Taverna workflow
GridSphere portal
Future work: GridWay metascheduler
• Some results: HPV case study
• Summary and conclusions
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
2
Phylogenetics: a reminder
• Phylogeny: reconstruction of the
evolutionary history (evolutionary
tree) of organisms
– Influence and relationship between
species
– Evolution of selected populations
At July 1837 Darwin draw
his first-know sketch of a
evolutionary tree
• Applications on Life Sciences,
Industry, etc:
– Know real history of evolution: Tree of Life
– Drug discovery
– Tracing geographical origin, dating
introduction of stumps
– Prediction of gene’s and proteins’ function
– Epidemiological studies
www.eu-eela.eu
Complete Tree of life
IWPACBB 2009. Salamanca, June 12th, 2009
3
Computational problem: so many trees…
Nº of possible labelled topologies with n species or taxa
Rooted
Trees:
Unrooted Trees:
n 1
N R n    2k  1
NU n  N R n  1
k 1
Nº of taxa
Nº of Rooted
Nº of Unrooted
trees
trees
Exhaustive enumeration
of all possible phylogenies
is not computationally
feasible
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
4
Computational methods
• Phenetics: no evolutionary model
– Distance-matrix based methods (Neighbour-Joining)
• Cladistics:
– Maximum Parsimony (not statistically consistent)
– Maximum Likelihood
– Bayesian inference (Markov Chain Monte Carlo): simulation
techniques for approximating posterior probability distribution of
trees
• MrBayes (http://mrbayes.csit.fsu.edu)
– Sequential and Parallel implementations (MPI enabled)
– High CPU and memory consumption:
 50 taxa: simulation of 250.000 generations ~ 50 hours in a P4 2.8Ghz
 2900 sequences of HIV-1
computational
challenge
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
5
Challenges for Bioinformatics
• Yet a computational problem
– Partial scientific community: inefficient local facilities
– Rise in provision of HPC facilities: additional skills required
• Different approach to access computing infrastructures
irrespective of their location
Grid Computing
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
6
Why Grid Computing?
• Grids represent a powerful new tool for e-Science
– Provide seamless sharing of computing and storage resources
– Enable the creation of scalable VOs: Biomed VO
– Service Grids (EGEE, EELA) and Opportunistic Grids
• Benefit for applications demanding non-trivial computing
capabilities
• Local and remote computing and storage facilities
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
7
Bioinformatics Grid resources
• Wide range of Bioinformatics resources through Web Interfaces:
– Projects of public databases (genomes, proteins, etc.):
 EMBL-EB I(UK), NCBI (USA), DDBJ and PDBJ (Japan), etc.
– Web services for Bioinformatics toolkits:
 EBI web services, NCBI Entrez Utils, DDBJ, BioMoby services
– Bioinformatics Web services Index/registry servers:
 EMBRACE service registry (BioCatalogue), BioMoby Central Registry
• Grid-enabled software packages:
– EELA-2: grEMBOSS (UNAM)
• Grid portals to mask applications
– Genius, GridSphere
• Grid infrastructures & VOs
– EGEE related: Biomed, GENE, EELA-prod VOs
– myGrid, caBIG, TeraGrid.
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
8
How to access MrBayes on Grid
• Simply sending a standard job to a site
– Software must be preinstalled in sites
– Successfully tested in several projects




National Grid Service (UK)
FIRB LIBI “International Laboratory for Bioinformatics” project (Italy)
BioinfoGRID project
EELA: MPI version installed and tested in EELA-CIEMAT site
– Supported by EELA-2/EGEE sites
• Grid bureaucracy: certificates, VOs, etc.
– Usually Biologists are not advanced grid users
• Need for friendly interfaces to Grid facilities
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
9
PhyloGrid aim
Offer to the scientific community an easy interface for
calculating phylogenies in Grid without requiring the
user knowledge about the computational procedure:
– Based on MPI-enabled version of MrBayes
 By means of a Taverna workflow
– Takes advantage of the computational power of actual Grid
infrastructures
The use of Taverna Workflows:
– Allows multiple database selection
– Extendable with access to complementary tools (Clustalw-MPI) or
other workflows (MyExperiment repository)
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
10
PhyloGrid architecture
GRID protocols
LFC Catalog
gLite
GRID
SE
WMS
WNs
CE
HTTPS
RLS
www.eu-eela.eu
Portal
Certificate
SOAP
GridSphere Portal +
WF Enactor/Engine
gLite UI +
Submission
WS
IWPACBB 2009. Salamanca, June 12th, 2009
11
Taverna Workflow Mgmt. System
• A bioinformatician could
easily implement Grid
Workflows without Grid
skills
• Public workflow
repository (myExperiment)
• Several Plugins to use WS
– MyGrid, CaBIG, GridSAM, BioMoby
– Many public databases
– GT4 services and gRavi developer framework
• Many tools/plugins
– Manipulating files, format converter, local and remote execution,
visualization applets, tools for accessing WS
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
12
PhyloGrid Workflow for MrBayes
• Input params received from
GridSphere portal
• ALN/ClustalW, PHYLIP,
MSA to NEXUS format
• Builds NEXUS file for
MrBayes
• Creates JDL file
• Job submission
• Nested workflow checks
Grid job execution
• Get output from SE
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
13
GridSphere portal
• PhyloGrid web portal built on top of GridSphere portal
framework (http://www.gridsphere.org):
– A Grid portal improves usability of Grids
 Hiding complexity of technology involved
– A Grid portal improves utilization of Grids
 Providing an appealing user-friendly Web Interface
 Enforcing Grid utilization policies
Snapshot of the virtual work area of
PhyloGrid Portal with some results
• PKI security, etc.
Cohesive
Grid portals
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
14
Future work: GridWay
• The JDL job approach
– Hard to handle job errors into Taverna workflow
– gLite plugin for Taverna is under development
 Taverna must be installed in a UI or,
 Use remote execution to a UI (Taverna remote workflow enactor)
• GridWay metascheduler
– Characteristics




Fully compatible with gLite based Grids (EELA-2, EGEE)
Better resource selection based on internal statistics
Automatic migration and re-schedule of failed jobs
Checkpointing management for large duration tasks
– Taverna binding implementation:
 WS GRAM interface deployed over GridWay
 By means of GT4 plugins or directly implementing a JSDL plugin
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
15
HPV case study with PhyloGrid
• HPV is a recognized underlying factor in Cervical Cancer:
– 90% cases shows infection from some HPV strand
• Complete HPV nucleotide seqs. about 8000 basis long:
– E1, E2, E4-E7 early expression and L1, L2 late expression genes
– HPV classification according to L1 variability (> 100 types)
– Two different categories with respect to oncogenic potential
• Study: check if this categorization really fits the evolutionary
history of HPV
– 121 HPV sequences
– Molecular phylogenetic calculations
for L1, L2 and E7 genes
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
16
Results obatined with PhyloGrid
Molecular Phylogeny of HPV in oncogenes from L1, L2, E7
•121 HPV nucleotide
sequences of L1 (the
major capsid gene)
•Phylogenetic tree for L1
•Broader lines means
differences between this
tree and tree derived
from L2 gene
•Topology similarity
score of 85% between
L1 and L2
Conflict with HPV classification based on variability of L1 gene
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
17
Summary and conclusions
• PhyloGrid is a tool for Phylogenetic studies on Grid by
means of MPI-enabled MrBayes:
– Friendly interface (GridSphere portal): no computational or grid
skills required to perform calculations.
– Automation of tasks: Taverna workflow
• PhyloGrid takes advantage of the computational power
of actual Grid infrastructures
– Allowing Phylogenetic analysis on large scale
– Reducing the technological divide that a partial scientific
community has for accessing computational platforms such as
Grid
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
20
Thanks for your attention
?
www.eu-eela.eu
IWPACBB 2009. Salamanca, June 12th, 2009
21
E-science grid facility for
Europe and Latin America
Contact
R. Isea1: raul.isea at gmail.com
E. Montes2: esther.montes at ciemat.es
A J. Rubio-Montero2: antonio.rubio at ciemat.es
R. Mayo2: rafael.mayo at ciemat.es
http://www.ciemat.es/portal.do?IDR=1481&TR=C
1Fundación
2CIEMAT
www.eu-eela.eu
IDEA (Venezuela)
(Spain)