PowerPoint Template

XSEDE '14, July 13 - 18 2014, Atlanta, GA, USA
Evaluating Distributed
Platforms for Protein-Guided
Scientific Workflow
Natasha Pavlovikj, Kevin Begcy, Sairam Behera,
Malachy Campbell, Harkamal Walia, Jitender S.Deogun
University of Nebraska-Lincoln
1
Introduction
 Gene expression and transcriptome analysis are
one of the main focuses of research for a great
number of biologists and scientists
 The analysis of this so called “big data” is done
by using a complex set of multitude of software
tools
 Enhanced demand of powerful computational
resources where the data can be stored and
analyzed
2
Assembly Pipeline
 Assembly of raw
sequence data is a
complex multi-stage
process composed of
preprocessing,
assembling, and postprocessing
 Assembly pipeline is
used to simplify the
entire assembly
process by automating
steps of the pipeline
3
blast2cap3
 Multiple approaches used for assembling the
filtered reads produce high redundancy of the
resulting transcripts
 Overlap-based assembly program CAP3 is used
to merge transcripts based on the overlapping
region with specified identity
 However, because most of the produced
transcripts code for a protein, a protein similarity
should be also considered during the merging
4
blast2cap3
 Blast2cap3 is a protein-guided assembly
approach that first clusters the transcripts based
on similarity to a common protein and then
passes each cluster to CAP3
 Blast2cap3 is a Python script written by Vince
Buffalo from Plant Sciences Department, UCD
 The recent use of blast2cap3 on the wheat
transcriptome assembly shows that blast2cap3
generates fewer artificially fused sequences and
reduces the total number of transcripts by 8-9%
5
blast2cap3
 The assembled transcripts are aligned with
protein datasets closely related to the organism
for which the transcripts are generated, and
afterwards, transcripts sharing a common protein
hit are merged using CAP3
 The current implementation of blast2cap3
supports only serial execution
6
Pegasus Workflow
Management System
 The modularity of blast2cap3 allows us to
decompose the existing approach on multiple
tasks, some of which can be run in parallel
 The protein-guided assembly can be structured
into a scientific workflow
7
Pegasus Workflow
Management System
 Pegasus WMS is a framework that automatically
maps high-level scientific workflows organized as
directed acyclic graph (DAG) onto wide range of
execution platforms, including clusters, grids,
and clouds
 Pegasus uses DAX (directed acyclic graph in
XML) files to specify an abstract workflow
 The abstract workflow contains information and
description of all executable files and logical
names of the input files used by the workflow
8
blast2cap3 with Pegasus
WMS
 Each node represents a workflow task, while
each edge represents the dependency between
the tasks
 Archive of all required built libraries and tools
(Python, Biopython, CAP3)
 The step of downloading and extracting this
archive is defined as a task in the workflow
 Pegasus WMS implementation of blast2cap3
reduces the running time of the current serial
implementation of blast2cap3 for more than 95%
9
10
Execution Platforms
 The resources that scientific workflows require
can exceed the capabilities of the local
computational resources
 Scientific workflows are usually executed on
distributed platforms, such as campus clusters,
grids or clouds
 Used execution platforms
11
Sandhills: University of
Nebraska Campus Cluster
 Sandhills is one of the High Performance
Computing (HPC) Clusters at the University of
Nebraska – Lincoln Holland Computing Center
(HCC)
 Used by faculty and students
 Sandhills was constructed in 2011 and it has
1440 AMD cores housed in a total of 44 nodes
 Every new user account of HCC is required to be
associated with a faculty or research group
12
OSG: Open Science Grid
 OSG is a national consortium of geographically
distributed academic institutions and laboratories
that provide hundreds computing and storage
resources to the OSG users
 OSG is organized into Virtual Organizations
 OSG does not own any computing or storage
resources, but allows users to use the resources
contributed by the other members of the OSG
and VO’s
 Every new user applies for an OSG certificate
13
Amazon EC2: Amazon
Elastic Compute Cloud
 Amazon Elastic Compute Cloud (Amazon EC2)
is a large commercial Web-based service
provided by Amazon.com
 Users have access to virtual machine (VM)
instances where they deploy VM images with
customized software and libraries
 Amazon EC2 is a scalable, elastic and flexible
platform
 Amazon EC2 users are hourly billed for the
number and the type of resources they are using
14
Experiments
 Investigate the behavior of the modified Pegasus
WMS implementation of blast2cap3 when the
workflow is composed of 30, 110, 210, 610,
1,010, and 2,010 tasks respectively
 Run the workflow multiple times on the different
execution platforms in order to detect the
different workflow performance as well as the
different resource availability over time
15
Experiments
 Compare the total workflow running time
between different execution platforms
 Examine the number of running versus the
number of idle jobs over time for each workflow
16
Experimental Data
 Diploid wheat Triticum urartu dataset from NCBI
 The assembled transcripts were generated using
Velvet as a de novo assembler
 These transcripts were aligned with closely
related wheat organisms (Barley, Brachypodium,
Rice, Maize, Sorghum, Arabidopsis)
 “transcripts.fasta”, 404 MB big, 236,529
assembled transcripts
 “alignments.out”, 155 MB big, 1,717,454 protein
hits
17
Comparing Running Time on Sandhills,
OSG and Amazon EC2 for Workflows with
Different Number of Tasks
18
Comparing the Number of Running Jobs
versus the Number of Idle Jobs Over Time
for Workflows with Different Task Number
19
Comparing the Number of Running Jobs
versus the Number of Idle Jobs Over Time
for Workflows with Different Task Number
20
Comparing the Number of Running Jobs
versus the Number of Idle Jobs Over Time
for Workflows with Different Task Number
21
Comparing the Number of Running Jobs
versus the Number of Idle Jobs Over Time
for Workflows with Different Task Number
22
Comparing the Number of Running Jobs
versus the Number of Idle Jobs Over Time
for Workflows with Different Task Number
23
Comparing the Number of Running Jobs
versus the Number of Idle Jobs Over Time
for Workflows with Different Task Number
24
Cost Comparison of Different Execution
Platforms
 The main and the most important difference
between the commercial cloud and the academic
distributed resources is the cost
 Sandhills:
 generally free resources
 OSG:
 completely free resources
 Amazon EC2:
 complex pricing model
 50 m1.large spot instance X $0.04 per hour = $122.84
25
Conclusion
 Using more than 100 tasks in a workflow
significantly reduces the running time for all
execution platforms
 The resource allocation on Sandhills and OSG is
opportunistic, and its availability changes over
time
 The results are almost constant when Amazon
EC2 is used
 Workflow failures were not encountered on
Sandhills and Amazon EC2
26
Conclusion
 The predictability of the Amazon EC2 resources
leads to better workflow running time when the
cloud is used as a platform
 For our blast2cap3 workflow, better running time
and better usage of the allocated resources were
achieved when Amazon EC2 is used
 Due to the Amazon EC2 cost, the academic
distributed systems can be a good alternative
27
Acknowledgments
 University of Nebraska Holland Computing
Center
 Open Science Grid
28