Here`s - iPlant Pods

Transforming Science Through Data-driven Discovery
Scaling Compute with R in CyVerse
Blake Joyce – Science Analyst
University of Arizona
[email protected]
Transforming Science Through Data-driven Discovery
Executive Team
Parker Antin
Nirav Merchant
Eric Lyons
Matt Vaughn
Doreen Ware
Dave Micklos
CyVerse is supported by the National Science Foundation under Grant No. DBI-0735191 and DBI-1265383.
Evolution of CyVerse
From plant science, to life science, and beyond…
iPlant 2008
Empowering a New Plant
Biology
iPlant 2013
Cyberinfrastructure for Life
Science
CyVerse 2016
Transforming Science
Through Data-Driven
Discovery
Who We Serve
Who We Serve
Space Object Behavioral Sciences
Who We Serve
1000s of researchers
But what is Cyberinfrastructure?
software
Platforms, tools, datasets
hardware
Storage and compute
people
Expertise, support, training,
CyVerse is People + Cyberinfrastructure, empowering researchers
http://www.cyverse.org
Who We Are: the SI Team
Ramona Walls
•4 years
•14 proposals
•5 grants
•14 publications
•Ecology
•Ontologies
•Data Standards
•Data Identifiers
•Data Commons
Upendra
Devisetty
•9 months
•3 publications
•Genomics
•Metagenomics
•Transcriptomics
•Docker Genius
Martha Narro
•8 years
•3 proposals
•3 publications
•Image
Management and
Analysis
•Project Management
Tyson Swetnam
•Aug 2016
•1 proposal
•4 grants
(continuing)
•2 publications
•GIS/drones
•Remote sensing
Past Science Informaticians
Blake Joyce
•Aug 2016
•1 proposal
•3 publications
•Agriculture
•Ecology
•Software
Carpentry
•Hacky Hour
I’ve Only Recently Learned Computation
• Background: Ecology and plant secondary metabolism engineering
• 2 years ago (Nov 2014) I could not code
• Took a Software Carpentry R course (Feb 2015)
• Co-taught SWC course (Oct 2015)
• Published first (Python) bioinformatic tool called FractBias (Aug 2016)
The Data Cycle
CyVerse Cyberinfrastructure
An Interoperable Ecosystem
iPlant Data Store
iPlant
Computational Resources
Particular focus
for this talk
It All Starts with the Data
• Data Store
• Specific request: Initial allocation of 100 GB
• Allocation can be increased (http://www.cyverse.org/content/increase-your-data-store-allocation)
• We need to report to NSF, so the allocation has to be fully filled out!
• Data Commons
•
•
•
•
Specific interest: “sequence data. data archiving/backup.”
Issue DOIs to data sets
Move data out to NCBI SRA and WGS through a form
Projects are being created currently
Discovery Environment Overview
Hands-on demo: Create a multiple alignment
1. Find a file in the Community Data folder
2. Download a small file of unaligned DNA sequences
3. Upload a small file
4. Use the MUSCLE App to align the sequences
5. Monitor the job status and export its parameters
6. View results
Atmosphere Overview
Productivity
Reproducibility
Get it Done
advantages...
•
•
Work in an on-demand Linux environment (most bioinformatics)
Collaborate with students and colleagues on the same instance
• Make data, workflows, and analyses available in a public image
• Access previous software version and images
• Multicore high memory images to run multithreading applications
• Move your analyses from your laptop to the cloud
Integrate Apps and Make Workflows
• We have switched to start using virtual container technology
• Packages all the dependencies into a container
• You build your own GUI for people to use
• Docker for the DE
• Docker files available on CyVerse GitHub repo
Access Rstudio Server (for free!)
• Atmosphere Rstudio image
• Use for development of R code
• Go to a browser and paste your IP address
• Add “:8787” -> Rstudio server listens on that port 8787
• Get the compute right and then when you’re happy move to Jetstream
• Jetstream
• Scale up the Rstudio compute you perfected on Atmosphere
• Share your code, image, or the Docker container you developed with anyone
• Allow reviewers to access all your compute/code/data to rebuild it
themselves
• ^ use R markdown or Jupyter notebooks for style points)
Specific Needs Mentioned
• “I would like to run specific DNA sequence assembly softwares on a
Linux supercomputer through which I could have remote access.”
• HPC and Cloud are 90% Linux computers
• This can be done on cloud computing, the DE, or the HPC through the
DE
Specific Needs Mentioned: Jupyter
• “Parallel processing in r”
• “Coding R in parallel, making R use less resources”
• To be frank: it’s not easy in R, but it’s possible
• Purist are going to get angry and mention lapply(), etc
• Python offers just in time compiling (JIT packages like {jit} don’t work in R -> Ra)
• Apache Spark works in Jupyter really smoothly (SDSC)
• Here’s my not answer for the question: Jupyter == Julia + Python + R
• Execute all different kinds of code in the same place/interface
• I know this is going to make me some enemies:
But my job is to bring you the tools, so….researchers have to move away from using a single
language
Specific Needs Mentioned: Training
• “Using R and Github in the university computing resources”
• Software Carpentry and Data Carpentry
• A great way to get introduce to basic coding
• Can learn how to start with data -> analysis -> advanced graphing
• Advanced graphing example: FractBias
• Research Bazaar Arizona
•
•
•
•
•
Weekly events designed to complement SWC/DC
For people that want hands on peer-to-peer training
Get help with specific problems, help others with specific problems
Build a community at UA. Learn what other departments are doing.
Drink tea or beer! (or don’t, it’s whatever)
A Shameless Plug: ResBazAZ
Specific Needs Mentioned: need more info
• “I hope to learn how to run r scripts on the super computer.”
• Answer: cloud computing A.K.A. “own your own ghost computer”
• “be interested in MATLAB links to the HPC. I am also wondering if
there is a Github interface with Cyverse.”
• “I want to know the amount of storage we are allowed to use, the
memory we can use, how many cores we can use, how to submit jobs
for parallel computing.”
Coming Features of Interest
• Bring your own compute to the DE
• Get an HPC allocation, Jetstream allocation, etc
• Use pre-existing apps in the DE and point it at your own compute
• Contact me, Susan Miller, or Nirav Merchant if interested
• Singularity on HPC
• Docker for HPC (though they hate that description)
• Virtualized environment that lets you install what you please and run it
• Doesn’t give root permissions so it’s secure
• R Shiny server
• Integration with Jupyter (and Rstudio Enterprise?)
• GIS