a pipeline for diagnosing sick kids: an example of composable

A PIPELINE FOR DIAGNOSING SICK
KIDS: AN EXAMPLE OF COMPOSABLE
BIOCOMPUTE OBJECTS
1
Genomic analysis for diagnosing sick
children
• Many different circumstances and different options:
• May be testing just the child
• May be testing parents also
• May not have both parents –
• But maybe have extended family members
• May have evidence of heredity
• Might eliminate searching for de novo variants
• Whole genome? Whole exome? Panel? RNA?
Epigenomic tests?
2
Sick kids pedigree pipeline(s) –
components in use
• Standard germline variant calling pipeline on each genome
• Same for WGS and WES
• Best practices pre-processing
•
•
•
•
Trio/ Quad analysis when available
Extended Pedigree analysis if needed
Mitochondrial DNA analysis on WGS
de Novo variant discovery and filtering
• without filtering very high percentage of de novo calls are false!!
• 70-120 true. 1000’s false positive
• Annotation
• Interpretation and diagnosis – not automated
3
Overall Pipeline: Components in Use
BAM
Generation
Variant
Calling
Annotation
Manual
Interpretation&
Diagnosis
4
Pedigree
Analysis
Mitochondiral
Analysis
DeNovo
Analysis
CURRENT BEST PRACTICES
Variant Discovery on each genome
SNV/Indel
Discovery
GATKHaplotypeCaller
Analysisready
BAM
GATKJointGenotyping
VariantDiscovery
Analysisready
SV/CNV
Calls
GATKVariantScore
Recalibration(VQSR)
SV/CNV Discovery
SV/CNVDiscovery
GenomSTRIP
JointCalling
GenomSTRIP
Genotyping
SV/CNV
Filtering
Mitochondrial
DNAAnalysis
Mitochondrial
DNAAnalysis
VariantQuality
Filtering
Preprocessing
5
SNV/Indel
Discovery
Analysisready
SNV/Indel
Calls
Heteroplasmy
Estimation
Analysisready
mtDNA
Variant
Calls
Pedigree Analysis Workflow
Analysisready
SNV/Indel
Calls
Clinical
Information
Pedigree
Analysis
Interpretation
Analysisready
SV/CNV
Calls
Analysisready
mtDNA
Variant
Calls
6
Results/
Denovo
Filtering
Extended
Pedigree
Analysis
Functional
Annotation
Pedigree Analysis - Quad
JOINTGENOTYPING
Proband
gVCF
Multi-sample
VCF
Father
gVCF
Mother
gVCF
VariantQuality
Score
Recalibration
Unaffected
Sibling
gVCF
Phase
genotypes
Calculate
genotype
posteriors
MALEFACTOR NYGCSoftware
Functional
Annotation+
Interpretation
Family
members
affection
status
Penetrance
model
7
Modeof
inheritance
Denovo
validation+
filtration
Variant
Filtration
Denovo
identification
Mitochondrial DNA Analysis
16569 1
rCRS-alt
rCRS
1 16569
AlignmenttoGRCh37withrCRS-BWAAligner
Heteroplasmy
discovery
Homoplasmy
discovery
Unified
VCF
AlignmenttoGRCh37withrCRS-alt-BWAAligner
Heteroplasmy
discovery
mtDNA copy
number relative
tonDNA
Homoplasmy
discovery
mtDNA copy
number relative
tonDNA
Unified
VCF
Reconstructed
VCFderived
fromrCRS and
rCRS-alt
8
Variantannotation
MitoMap,
ClinVar,OMIM
Varianteffect
scoringfor
prioritization
Tumor/normal
comparison
Variant
Interpretation
Name: Steven Walerstein
Project: Project_CLIN_11377_B01_SOM_WGS
Sample ID: ONC15-50N-D
Analysis: Whole genome sequencing data generated as a part of participating in General Population
Research Study was used for the ancestry analysis. 1000 Genomes Project data was used as reference for
population stratification.
Ancestry analysis – Apply population
specific allele frequencies
Central Asian
2.1%
Italian / Balkan
12.1%
Middle Eastern
0.7%
European Jewish or East
Mediterranean
85.1%
Note:9 These proportions show the approximate locations of your ancestors ~500 years ago, as determined by a comparison of your DNA to that of a set of reference
Composability
• These various individual compute components can be used
in many contexts.
• Variant calling is the same for WGS and WES
• Structural variant calling is more accurate on whole genome
• Mitochondrial DNA analysis is often neglected in standard WGS
analysis workflows. Can be performed on any whole genome
without the need for a dedicated Mitochondrial panel.
• Different combinations may be needed for research/ trials on
cohorts than are needed for individual clinical samples – eg joint
genotyping.
• Pedigree vs Extended Pedigree is a choice per proband
• How is the right pipeline selected?
• Should these be combined?
• Ancestry analysis is helpful to evaluate population specific
variant allele frequencies
10
Versioning
• These are clinical pipelines
• We change our research pipelines frequently
• And periodically validate the clinical pipelines and
submit to state for re-certification.
• How will that work with a bio-compute repository?
• Is that validation sufficient?
• Is another required?
• What if the two don’t agree?
• Standard operating procedures for validating any
changes in sequencing or bioinformatics analyses
workflows
11
Questions
• How do we handle version management in a bio-compute
repository?
• Different projects needs different versions
• Different projects need different references
• Where are standard data like references stored?
• What tools will we provide for composability?
• Do we need to?
• How do we validate pipelines in a repository?
• What could change to invalidate a pipeline?
• Is there a risk that composing many bio-compute objects
from the repository could make re-identification easier?
• Do we deal with that in any technical way?
• Reproducibility:
• Data isn’t always there
12
Acknowledgements
•
•
•
•
•
•
•
13
Avinash Abhyankar
Belinda Cornes
Anne-Katrin Emde
Giuseppe Narzise
Bo-Juen Chen
Jimmy Lin
Christian Stolte
•
•
•
•
•
Shailu Gargeya
Clint Howarth
Manisha Kher
Terry Dontje
Uday Evani