The Cloud for Biologists - using bioinformatics tools

The Cloud for Biologists
using bioinformatics tools
Mattias de Hollander
Netherlands Institute of
Ecology (NIOO-KNAW)
Galaxy Cloudman NIOO Thanks! Questions
Why choose for the Cloud?
2 / 16
Galaxy Cloudman NIOO Thanks! Questions
Why choose for the Cloud?
It’s flexible
2 / 16
Galaxy Cloudman NIOO Thanks! Questions
Why choose for the Cloud?
It’s flexible
You have full control
2 / 16
Galaxy Cloudman NIOO Thanks! Questions
Why choose for the Cloud?
It’s flexible
You have full control
Perfect for small labs
2 / 16
Galaxy Cloudman NIOO Thanks! Questions
Why choose for the Cloud?
It’s flexible
You have full control
Perfect for small labs
It’s fancy (Google and Amazon are using it)
2 / 16
Galaxy Cloudman NIOO Thanks! Questions
Why choose for the Cloud?
It’s flexible
You have full control
Perfect for small labs
It’s fancy (Google and Amazon are using it)
It’s environmental friendly (Gmail: Its cooler in the cloud)
2 / 16
Galaxy Cloudman NIOO Thanks! Questions
How do we use the Cloud?
3 / 16
Galaxy Cloudman NIOO Thanks! Questions
Galaxy
a web-based genome analysis platform1
1
Slide by Anton Nekrutenko, Galaxy Developer Conference 2011, Lunteren (NL)
4 / 16
Galaxy Cloudman NIOO Thanks! Questions
Galaxy
a web-based genome analysis platform1
A free (for everyone) web service integrating a wealth of
tools, compute resources, terabytes of reference data and
permanent storage
1
Slide by Anton Nekrutenko, Galaxy Developer Conference 2011, Lunteren (NL)
4 / 16
Galaxy Cloudman NIOO Thanks! Questions
Galaxy
a web-based genome analysis platform1
A free (for everyone) web service integrating a wealth of
tools, compute resources, terabytes of reference data and
permanent storage
Open source software that makes integrating your own tools
and data and customizing for your own site simple
1
Slide by Anton Nekrutenko, Galaxy Developer Conference 2011, Lunteren (NL)
4 / 16
Galaxy Cloudman NIOO Thanks! Questions
5 / 16
Galaxy Cloudman NIOO Thanks! Questions
Most biologists don’t write code
6 / 16
Galaxy Cloudman NIOO Thanks! Questions
Most biologists don’t write code
Analyze
Interactively manipulate genomic data with a comprehensive
and expanding ’best-practices’ toolset
6 / 16
Galaxy Cloudman NIOO Thanks! Questions
Most biologists don’t write code
Analyze
Interactively manipulate genomic data with a comprehensive
and expanding ’best-practices’ toolset
Publish and Share
Results and step-by-step analysis record (Data Libraries and
Histories)
Customizable pipelines (Workflows)
Share workflows with other users
6 / 16
Galaxy Cloudman NIOO Thanks! Questions
Cloudman
7 / 16
Galaxy Cloudman NIOO Thanks! Questions
What is Cloudman?
8 / 16
Galaxy Cloudman NIOO Thanks! Questions
What is Cloudman?
Cloudman is written by Enis Afghan et.al., Emory University
and provides a ready-to-run, dynamically scalable version of
Galaxy on Amazon AWS
8 / 16
Galaxy Cloudman NIOO Thanks! Questions
What is Cloudman?
Cloudman is written by Enis Afghan et.al., Emory University
and provides a ready-to-run, dynamically scalable version of
Galaxy on Amazon AWS
Now it’s possible to run it also on the SARA HPC Cloud /
Opennebula (with some limitations)
8 / 16
Galaxy Cloudman NIOO Thanks! Questions
How does it work?
9 / 16
Galaxy Cloudman NIOO Thanks! Questions
How does it work?
A master node contains all the data and tools
9 / 16
Galaxy Cloudman NIOO Thanks! Questions
How does it work?
A master node contains all the data and tools
Initiate worker nodes based on needs/load
9 / 16
Galaxy Cloudman NIOO Thanks! Questions
How does it work?
A master node contains all the data and tools
Initiate worker nodes based on needs/load
Data is available on all nodes using a shared filesystem: NFS
9 / 16
Galaxy Cloudman NIOO Thanks! Questions
How does it work?
A master node contains all the data and tools
Initiate worker nodes based on needs/load
Data is available on all nodes using a shared filesystem: NFS
RabbitMQ is used for communication between cluster nodes
9 / 16
Galaxy Cloudman NIOO Thanks! Questions
How does it work?
A master node contains all the data and tools
Initiate worker nodes based on needs/load
Data is available on all nodes using a shared filesystem: NFS
RabbitMQ is used for communication between cluster nodes
Jobs are queued using SGE
9 / 16
Galaxy Cloudman NIOO Thanks! Questions
How does it work?
A master node contains all the data and tools
Initiate worker nodes based on needs/load
Data is available on all nodes using a shared filesystem: NFS
RabbitMQ is used for communication between cluster nodes
Jobs are queued using SGE
Galaxy is served using nginx webserver
9 / 16
Galaxy Cloudman NIOO Thanks! Questions
Workers instances are being configured
10 / 16
Galaxy Cloudman NIOO Thanks! Questions
Galaxy is accessible
11 / 16
Galaxy Cloudman NIOO Thanks! Questions
How is Galaxy used at the NIOO?
12 / 16
Galaxy Cloudman NIOO Thanks! Questions
How is Galaxy used at the NIOO?
13 / 16
Galaxy Cloudman NIOO Thanks! Questions
How is Galaxy used at the NIOO?
Analyzing high-throughput community sequencing data with
QIIME
13 / 16
Galaxy Cloudman NIOO Thanks! Questions
How is Galaxy used at the NIOO?
Analyzing high-throughput community sequencing data with
QIIME
Denoising (CPU-intensive)
13 / 16
Galaxy Cloudman NIOO Thanks! Questions
How is Galaxy used at the NIOO?
Analyzing high-throughput community sequencing data with
QIIME
Denoising (CPU-intensive)
OTU and representative set picking using uclust, cdhit, mothur
BLAST or other tools
13 / 16
Galaxy Cloudman NIOO Thanks! Questions
How is Galaxy used at the NIOO?
Analyzing high-throughput community sequencing data with
QIIME
Denoising (CPU-intensive)
OTU and representative set picking using uclust, cdhit, mothur
BLAST or other tools
Taxonomy assignment with BLAST or the RDP classifier
(CPU-intensive)
13 / 16
Galaxy Cloudman NIOO Thanks! Questions
How is Galaxy used at the NIOO?
Analyzing high-throughput community sequencing data with
QIIME
Denoising (CPU-intensive)
OTU and representative set picking using uclust, cdhit, mothur
BLAST or other tools
Taxonomy assignment with BLAST or the RDP classifier
(CPU-intensive)
Sequence alignment with PyNAST, muscle, infernal, or other
tools (CPU-intensive)
13 / 16
Galaxy Cloudman NIOO Thanks! Questions
How is Galaxy used at the NIOO?
Analyzing high-throughput community sequencing data with
QIIME
Denoising (CPU-intensive)
OTU and representative set picking using uclust, cdhit, mothur
BLAST or other tools
Taxonomy assignment with BLAST or the RDP classifier
(CPU-intensive)
Sequence alignment with PyNAST, muscle, infernal, or other
tools (CPU-intensive)
and more!
13 / 16
Galaxy Cloudman NIOO Thanks! Questions
Thanks!
14 / 16
Galaxy Cloudman NIOO Thanks! Questions
Thanks to the Galaxy Cloud Team
15 / 16
Galaxy Cloudman NIOO Thanks! Questions
Questions?
16 / 16
Galaxy Cloudman NIOO Thanks! Questions
Extra slides
16 / 16
Galaxy Cloudman NIOO Thanks! Questions
Limitations of Opennebula
Create instances providing user data (available in production
cloud?)
No support for growing qcow filesystem
Would be create to access the cloud the ON API from outside
Cloned instances have not a working network
16 / 16
Galaxy Cloudman NIOO Thanks! Questions
More info at
My notes:
https://www.cloud.sara.nl/projects/galaxy/wiki
Galaxy Cloud on Amazon: http://usegalaxy.org/cloud
Cloudman scripts:
https://bitbucket.org/galaxy/cloudman/
Install tools:
https://bitbucket.org/afgane/mi-deployment
Bio-linux repository: http:
//nebc.nerc.ac.uk/tools/bio-linux/bio-linux-6.0
16 / 16
Galaxy Cloudman NIOO Thanks! Questions
Launch Cloudman Console
16 / 16
Galaxy Cloudman NIOO Thanks! Questions
Master node is online
16 / 16
Galaxy Cloudman NIOO Thanks! Questions
Add extra worker nodes
16 / 16
Galaxy Cloudman NIOO Thanks! Questions
New instances are pending
16 / 16
Galaxy Cloudman NIOO Thanks! Questions
New instances are pending #2
16 / 16
Galaxy Cloudman NIOO Thanks! Questions
New instances are running
16 / 16
Galaxy Cloudman NIOO Thanks! Questions
New instances are online
16 / 16
Galaxy Cloudman NIOO Thanks! Questions
Galaxy is accessible
16 / 16