Operational lessons from running Openstack and Ceph for cancer research at scale George Mihaiescu, Senior Cloud Architect Jared Baker, Cloud Specialist ONTARIO INSTITUTE FOR CANCER RESEARCH OICR • Largest cancer research institute in Canada, funded by the government of Ontario • Together with its collaborators and partners supports more than 1,700 researchers, clinician scientists research staff and trainees • OICR hosts the ICGC's secretariat and its data coordination centre 2 ONTARIO INSTITUTE FOR CANCER RESEARCH ICGC - International Cancer Genome Consortium 3 ONTARIO INSTITUTE FOR CANCER RESEARCH Cancer Genome Collaboratory Project goals and motivation • Cloud computing environment built for biomedical research by OICR, and funded by government of Canada grants • Enables large scale cancer research on the world’s largest cancer genome dataset currently produced by the International Cancer Genome Consortium (ICGC) • Entirely built using open-source software like Openstack and Ceph • Compute infrastructure goal to provide 3,000 cores and 15 PB storage • A system for cost-recovery 4 ONTARIO INSTITUTE FOR CANCER RESEARCH Genomics 5 ONTARIO INSTITUTE FOR CANCER RESEARCH Genomics workloads • Users first download large files (150 - 300 GB), then they run workflows that analyze the data for days, or even weeks • Resulting data can be as large as the input data (alignment), or much smaller (mutation calling, 5-10 GB) • It is recommended that the workloads are independent, so one VM failure doesn’t affect multiple analyses • Newly designed workflows and algorithms are packaged as Docker containers for portability 6 ONTARIO INSTITUTE FOR CANCER RESEARCH Genomics workloads 7 ONTARIO INSTITUTE FOR CANCER RESEARCH Genomics workloads 8 ONTARIO INSTITUTE FOR CANCER RESEARCH Genomics workloads 9 ONTARIO INSTITUTE FOR CANCER RESEARCH Capacity vs. performance 10 ONTARIO INSTITUTE FOR CANCER RESEARCH Wisely pick your battles 11 ONTARIO INSTITUTE FOR CANCER RESEARCH No frills design • Use high density commodity servers to reduce physical footprint & related overhead • Use open source software and tools • Prefer copper over fiber for network connectivity • Spend 100% of the hardware budget on the infrastructure that supports cancer research, not on licenses or “nice to have” features 12 ONTARIO INSTITUTE FOR CANCER RESEARCH Other design constraints •Limited datacenter space (12 racks) •Fixed hardware budget with high data storage requirements •There are no local backups for the large data sets and reimporting the data, though possible is not desirable (+500 TB takes time to reimport over the Internet) 13 ONTARIO INSTITUTE FOR CANCER RESEARCH Hardware architecture Compute nodes 14 ONTARIO INSTITUTE FOR CANCER RESEARCH Hardware architecture Ceph storage nodes 15 ONTARIO INSTITUTE FOR CANCER RESEARCH Control plane • Three controllers in HA configuration (2 x 6 cores CPU, 128 GB RAM, 6 x 200 GB Intel S3700 SSD drives) • Operating system and Ceph Mon on the first RAID 1 container • Mariadb/Galera on the second RAID 1 container • Ceilometer with Mongodb on the third RAID 1 container • Haproxy (SSL termination) and Keepalived • 4 x 10 GbE bonded interfaces, 802.3ad, layer 3+4 hash • Neutron + GRE, HA routers, no DVR 16 ONTARIO INSTITUTE FOR CANCER RESEARCH Networking • Brocade ICX 7750-48C top-of-rack switches configured in a stack ring topology • 6 x 40Gb Twinax cables between the racks, providing 240 Gbps nonblocking redundant connectivity (2:1 oversubscription ratio) 17 ONTARIO INSTITUTE FOR CANCER RESEARCH Software – entirely open source 18 ONTARIO INSTITUTE FOR CANCER RESEARCH Custom object storage client developed at OICR • A client-server application for both uploading and downloading data using temporary pre-signed URLs from multiple object storage systems • Core features • Support for encrypted and authorized transfers • High-throughput: multi-part parallel upload/download • Resumable downloads/uploads • Download-specific features • Support for BAM slicing • Support for Filesystem in Userspace (FUSE) https://github.com/icgc-dcc/dcc-storage https://hub.docker.com/r/icgc/icgc-storage-client/ 19 ONTARIO INSTITUTE FOR CANCER RESEARCH Cloud usage • 57,000 instances started in the last 2 years • 6,800 in the last three months • 50 users in 16 research labs across three continents • More than 500 TB (1.5 PB) stored in Ceph 20 ONTARIO INSTITUTE FOR CANCER RESEARCH In-house developed usage reporting app 21 ONTARIO INSTITUTE FOR CANCER RESEARCH Openstack Upgrades 22 ONTARIO INSTITUTE FOR CANCER RESEARCH Ceph Upgrades 23 ONTARIO INSTITUTE FOR CANCER RESEARCH Security Updates 24 ONTARIO INSTITUTE FOR CANCER RESEARCH ELK Ops dashboard 25 ONTARIO INSTITUTE FOR CANCER RESEARCH ELK Ops dashboard 26 Deployments • Evolving each deployment • Open to improvements • Avoid being tedious ONTARIO INSTITUTE FOR CANCER RESEARCH 27 ONTARIO INSTITUTE FOR CANCER RESEARCH Operations details • • • • On-site spares and technicians Let Ceph heal itself Monitor everything Can you script that? 28 ONTARIO INSTITUTE FOR CANCER RESEARCH ARA- Ansible Run Analysis 29 ONTARIO INSTITUTE FOR CANCER RESEARCH VLAN based networking 30 ONTARIO INSTITUTE FOR CANCER RESEARCH Ceph Monitoring IOPS 31 ONTARIO INSTITUTE FOR CANCER RESEARCH Ceph Monitoring Performance & Integrity 32 ONTARIO INSTITUTE FOR CANCER RESEARCH Ceph Monitoring Radosgw throughput 33 ONTARIO INSTITUTE FOR CANCER RESEARCH Ceph Monitoring Rebalancing - network traffic 34 ONTARIO INSTITUTE FOR CANCER RESEARCH Ceph Monitoring Rebalancing - cpu 35 ONTARIO INSTITUTE FOR CANCER RESEARCH Ceph Monitoring Rebalancing - memory 36 ONTARIO INSTITUTE FOR CANCER RESEARCH Ceph Monitoring Rebalancing - iops 37 ONTARIO INSTITUTE FOR CANCER RESEARCH Ceph Monitoring Rebalancing - disk 38 Rally ONTARIO INSTITUTE FOR CANCER RESEARCH Smoke tests & Load tests 39 Rally ONTARIO INSTITUTE FOR CANCER RESEARCH Grafana integration 40 ONTARIO INSTITUTE FOR CANCER RESEARCH Capacity usage 41 ONTARIO INSTITUTE FOR CANCER RESEARCH Lessons learned • If something needs to be running, test it • Simple tasks sometimes are not • Be generous with your specs for the monitoring and control plane (more RAM and CPU than you might think it will be needed) • More RAM and CPU on the Ceph storage nodes allow you to have larger nodes and not be affected by small memory leaks • Monitor RAM usage aggregated per process types • It’s possible to run a stable and performant Openstack cluster with few but qualified resources, as long as you carefully design it and choose the most stable (and absolutely needed) Openstack projects and configurations. 42 ONTARIO INSTITUTE FOR CANCER RESEARCH Future plans • Upgrade to Ubuntu 16.04 and Openstack Newton • Build a new and larger environment with a similar design, but a leaf-spine networking design • Investigate the stability of a container-based control plane (Kolla) 43 ONTARIO INSTITUTE FOR CANCER RESEARCH Thank you • Discovery Frontiers: Advancing Big Data Science in Genomics Research program (grant no. RGPGR/448167-2013, ‘The Cancer Genome Collaboratory’) • Natural Sciences and Engineering Research Council (NSERC) of Canada • the Canadian Institutes of Health Research (CIHR), Genome Canada • the Canada Foundation for Innovation (CFI) • Ontario Research Fund of the Ministry of Research, Innovation and Science. 44 Funding for the Ontario Institute for Cancer Research is provided by the Government of Ontario ONTARIO INSTITUTE FOR CANCER RESEARCH Contact George Mihaiescu [email protected] Jared Baker [email protected] www.cancercollaboratory.org Questions? 46
© Copyright 2026 Paperzz