Operational lessons from running Openstack and Ceph for cancer

Operational lessons from running
Openstack and Ceph for cancer
research at scale
George Mihaiescu, Senior Cloud Architect
Jared Baker, Cloud Specialist
ONTARIO INSTITUTE FOR CANCER RESEARCH
OICR
•
Largest cancer research institute in Canada, funded by
the government of Ontario
•
Together with its collaborators and partners supports
more than 1,700 researchers, clinician scientists research
staff and trainees
•
OICR hosts the ICGC's secretariat and its data
coordination centre
2
ONTARIO INSTITUTE FOR CANCER RESEARCH
ICGC - International Cancer
Genome Consortium
3
ONTARIO INSTITUTE FOR CANCER RESEARCH
Cancer Genome Collaboratory
Project goals and motivation
•
Cloud computing environment built for biomedical research by OICR, and funded by
government of Canada grants
•
Enables large scale cancer research on the world’s largest cancer genome dataset
currently produced by the International Cancer Genome Consortium (ICGC)
•
Entirely built using open-source software like Openstack and Ceph
•
Compute infrastructure goal to provide 3,000 cores and 15 PB storage
•
A system for cost-recovery
4
ONTARIO INSTITUTE FOR CANCER RESEARCH
Genomics
5
ONTARIO INSTITUTE FOR CANCER RESEARCH
Genomics workloads
•
Users first download large files (150 - 300 GB), then they run
workflows that analyze the data for days, or even weeks
•
Resulting data can be as large as the input data (alignment), or
much smaller (mutation calling, 5-10 GB)
•
It is recommended that the workloads are independent, so one VM
failure doesn’t affect multiple analyses
•
Newly designed workflows and algorithms are packaged as Docker
containers for portability
6
ONTARIO INSTITUTE FOR CANCER RESEARCH
Genomics workloads
7
ONTARIO INSTITUTE FOR CANCER RESEARCH
Genomics workloads
8
ONTARIO INSTITUTE FOR CANCER RESEARCH
Genomics workloads
9
ONTARIO INSTITUTE FOR CANCER RESEARCH
Capacity vs. performance
10
ONTARIO INSTITUTE FOR CANCER RESEARCH
Wisely pick your battles
11
ONTARIO INSTITUTE FOR CANCER RESEARCH
No frills design
• Use high density commodity servers to reduce physical
footprint & related overhead
• Use open source software and tools
• Prefer copper over fiber for network connectivity
• Spend 100% of the hardware budget on the infrastructure
that supports cancer research, not on licenses or “nice to
have” features
12
ONTARIO INSTITUTE FOR CANCER RESEARCH
Other design constraints
•Limited datacenter space (12 racks)
•Fixed hardware budget with high data storage requirements
•There are no local backups for the large data sets and reimporting the data, though possible is not desirable (+500 TB
takes time to reimport over the Internet)
13
ONTARIO INSTITUTE FOR CANCER RESEARCH
Hardware architecture
Compute nodes
14
ONTARIO INSTITUTE FOR CANCER RESEARCH
Hardware architecture
Ceph storage nodes
15
ONTARIO INSTITUTE FOR CANCER RESEARCH
Control plane
•
Three controllers in HA configuration (2 x 6 cores CPU, 128 GB RAM, 6 x 200 GB Intel
S3700 SSD drives)
•
Operating system and Ceph Mon on the first RAID 1 container
•
Mariadb/Galera on the second RAID 1 container
•
Ceilometer with Mongodb on the third RAID 1 container
•
Haproxy (SSL termination) and Keepalived
•
4 x 10 GbE bonded interfaces, 802.3ad, layer 3+4 hash
•
Neutron + GRE, HA routers, no DVR
16
ONTARIO INSTITUTE FOR CANCER RESEARCH
Networking
•
Brocade ICX 7750-48C top-of-rack switches configured in a stack
ring topology
•
6 x 40Gb Twinax cables between the racks, providing 240 Gbps nonblocking redundant connectivity (2:1 oversubscription ratio)
17
ONTARIO INSTITUTE FOR CANCER RESEARCH
Software – entirely open source
18
ONTARIO INSTITUTE FOR CANCER RESEARCH
Custom object storage client
developed at OICR
•
A client-server application for both uploading and downloading data using temporary
pre-signed URLs from multiple object storage systems
•
Core features
•
Support for encrypted and authorized transfers
•
High-throughput: multi-part parallel upload/download
•
Resumable downloads/uploads
•
Download-specific features
•
Support for BAM slicing
•
Support for Filesystem in Userspace (FUSE)
https://github.com/icgc-dcc/dcc-storage
https://hub.docker.com/r/icgc/icgc-storage-client/
19
ONTARIO INSTITUTE FOR CANCER RESEARCH
Cloud usage
• 57,000 instances started in the last 2 years
• 6,800 in the last three months
• 50 users in 16 research labs across three continents
• More than 500 TB (1.5 PB) stored in Ceph
20
ONTARIO INSTITUTE FOR CANCER RESEARCH
In-house developed usage
reporting app
21
ONTARIO INSTITUTE FOR CANCER RESEARCH
Openstack Upgrades
22
ONTARIO INSTITUTE FOR CANCER RESEARCH
Ceph Upgrades
23
ONTARIO INSTITUTE FOR CANCER RESEARCH
Security Updates
24
ONTARIO INSTITUTE FOR CANCER RESEARCH
ELK Ops dashboard
25
ONTARIO INSTITUTE FOR CANCER RESEARCH
ELK Ops dashboard
26
Deployments
•
Evolving each deployment
•
Open to improvements
•
Avoid being tedious
ONTARIO INSTITUTE FOR CANCER RESEARCH
27
ONTARIO INSTITUTE FOR CANCER RESEARCH
Operations details
•
•
•
•
On-site spares and technicians
Let Ceph heal itself
Monitor everything
Can you script that?
28
ONTARIO INSTITUTE FOR CANCER RESEARCH
ARA- Ansible Run Analysis
29
ONTARIO INSTITUTE FOR CANCER RESEARCH
VLAN based networking
30
ONTARIO INSTITUTE FOR CANCER RESEARCH
Ceph Monitoring
IOPS
31
ONTARIO INSTITUTE FOR CANCER RESEARCH
Ceph Monitoring
Performance & Integrity
32
ONTARIO INSTITUTE FOR CANCER RESEARCH
Ceph Monitoring
Radosgw throughput
33
ONTARIO INSTITUTE FOR CANCER RESEARCH
Ceph Monitoring
Rebalancing - network traffic
34
ONTARIO INSTITUTE FOR CANCER RESEARCH
Ceph Monitoring
Rebalancing - cpu
35
ONTARIO INSTITUTE FOR CANCER RESEARCH
Ceph Monitoring
Rebalancing - memory
36
ONTARIO INSTITUTE FOR CANCER RESEARCH
Ceph Monitoring
Rebalancing - iops
37
ONTARIO INSTITUTE FOR CANCER RESEARCH
Ceph Monitoring
Rebalancing - disk
38
Rally
ONTARIO INSTITUTE FOR CANCER RESEARCH
Smoke tests & Load tests
39
Rally
ONTARIO INSTITUTE FOR CANCER RESEARCH
Grafana integration
40
ONTARIO INSTITUTE FOR CANCER RESEARCH
Capacity usage
41
ONTARIO INSTITUTE FOR CANCER RESEARCH
Lessons learned
•
If something needs to be running, test it
•
Simple tasks sometimes are not
•
Be generous with your specs for the monitoring and control plane (more RAM and CPU than you might
think it will be needed)
•
More RAM and CPU on the Ceph storage nodes allow you to have larger nodes and not be affected
by small memory leaks
•
Monitor RAM usage aggregated per process types
•
It’s possible to run a stable and performant Openstack cluster with few but qualified resources, as long
as you carefully design it and choose the most stable (and absolutely needed) Openstack projects
and configurations.
42
ONTARIO INSTITUTE FOR CANCER RESEARCH
Future plans
• Upgrade to Ubuntu 16.04 and Openstack Newton
• Build a new and larger environment with a similar
design, but a leaf-spine networking design
• Investigate the stability of a container-based
control plane (Kolla)
43
ONTARIO INSTITUTE FOR CANCER RESEARCH
Thank you
•
Discovery Frontiers: Advancing Big Data Science in Genomics Research
program (grant no. RGPGR/448167-2013, ‘The Cancer Genome
Collaboratory’)
•
Natural Sciences and Engineering Research Council (NSERC) of Canada
•
the Canadian Institutes of Health Research (CIHR), Genome Canada
•
the Canada Foundation for Innovation (CFI)
•
Ontario Research Fund of the Ministry of Research, Innovation and Science.
44
Funding for the Ontario Institute for Cancer Research
is provided by the Government of Ontario
ONTARIO INSTITUTE FOR CANCER RESEARCH
Contact
George Mihaiescu
[email protected]
Jared Baker
[email protected]
www.cancercollaboratory.org
Questions?
46