Interpreting exomes and genomes: a beginner`s guide

Interpreting exomes and genomes:
a beginner’s guide
Daniel MacArthur
Analytic and Translational Genetics Unit
Massachusetts General Hospital
Broad Institute of Harvard and MIT
www.macarthurlab.org
Twitter: @dgmacarthur
Overview
• Fundamentals of next-generation
sequencing
• Genomes, exomes and targeted panels
• Genomic diagnosis: how do we filter
causal variants from a patient’s entire
genome?
• Major challenges for NGS diagnosis
Next-generation sequencing
• Many different technologies
Illumina
Pacific
Biosciences
Oxford
Nanopore
• Can chop up DNA and read bits of fragment all
at the same time – massively parallel
sequencing
Sequencing yields billions of reads per run
CGTTACGGCAGACG
TTTGAACTTTCATAG
GGGACATATTCGAAAT
ACGGGATGTACG
TAGACATAGACGACT
GGGATGTACGAA
GTACTGACCAG
GACCAGTAGAC
GACATAGACGACT
CCAGTAGACATA
ACGAGCCGTAGCTA
TTTGACGGGATG
GGGATGTACGA
AGACGACTTTGAC
CGAGCCGTAGCTA
ATAGACGACTTTGA
GGGATGTACGAG
GGGATGTATGAG
TACGAGCCGTA
TGTACGAGCCGTA
Compare the reads to a reference genome
GTACTGACCAGTAGACATAGACGACTTTGACGGGATGTACGAGCCGTAGCTA
ACGGGATGTACG
TAGACATAGACGACT
GGGATGTACGAA
GTACTGACCAG
GACCAGTAGAC
GACATAGACGACT
CCAGTAGACATA
ACGAGCCGTAGCTA
TTTGACGGGATG
GGGATGTACGA
AGACGACTTTGAC
CGAGCCGTAGCTA
ATAGACGACTTTGA
GGGATGTACGAG
GGGATGTATGAG
TACGAGCCGTA
TGTACGAGCCGTA
NGS allows us to sample the sequence position many times over
GTACTGACCAGTAGACATAGACGACTTTGACGGGATGTACGAGCCGTAGCTA
TAGACATAGACGACT
ACGGGATGTATG
GGGATGTATGA
GTACTGACCAG
TTTGACGGGATG ATGAGCCGTAGCTA
GACCAGTAGAC
GTACGAGCCGTA
CCAGTAGACATA
TGAGCCGTAGCTA
GACATAGACGACT
GGGATGTATGAG
ATAGACGACTTTGA GGGATGTACGAG
TACGAGCCGTA
AGACGACTTTGAC
TGTACGAGCCGTA
Challenges:
• Mapping short reads
• Variable coverage
• Base calling quality
• Tend to be worse for insertions
and deletions compared to SNPs
C -> T
(5 C / 5 T)
Which technology to choose?
Technology
Percent of Genome
Sequenced
Whole Genome
Sequencing
>95%
Whole Exome
Sequencing
~1.5%
(protein-coding regions)
Targeted
Sequencing
0.005% - 0.1%
(100s – 1000s of genes)
Cost
Depth of
Coverage
Targeted sequencing
Targeted sequencing
The problem with exome data
• Clinically and genetically
heterogeneous conditions
x 30,000 rows
Sifting signal from noise in exomes
• Every genome contains many rare,
potentially functional variants
–
–
–
–
–
–
~500 rare missense variants
~100 LoF variants: ~20 homozygous, ~20 rare
~100 rare variants in known disease genes
5-10 recessive disease-causing mutations
1-2 de novo coding mutations
sequencing errors
• In Mendelian disease patients we need to find
1-2 true causal mutations amidst this “noise”
How do we find pathogenic variants?
1. Is the variant a known pathogenic
variant? How much evidence supports
the claim of pathogenicity?
2. Is the variant rare?
3. Is it predicted to have a functional
impact (change a protein sequence)?
4. Does it segregate with disease?
5. Is the gene associated with the disease?
Making sense of one genome requires
tens of thousands of genomes
vs
More than 500K exomes and 50K genomes have
been sequenced worldwide
but these data are siloed by project and
inconsistently processed
Exome Aggregation
Consortium
(ExAC)
Sample Size (N) and
Ancestral Diversity
World Popula
50000
40000
East Asian
Latino
South Asian
European
African
Middle Eastern
African
European
Native American ancestry
DiverseAsian
Other
South
East Asian
Other
10000
20000
30000
Scaled to ExAC h
0
Individuals with Exome Sequence Data
60000
1000 Genomes, ESP, ExAC
1000 Genomes
1000
Genomes
ESP
ESP
exac.broadinstitute.org
ExAC
ExAC
World proportion
Value of reference databases
• Provide variant frequency in a large
population (either healthy, or
“reference” i.e. population sample)
• Provide frequency across multiple
human populations
• Allow us to assess how many variants we
see in a particular gene
• Provide an unbiased estimate of variant
penetrance
Lessons from ExAC
• Many “healthy” people carry apparently
disease-causing variants
– over 20,000 reported disease variants are
seen in our “healthy” samples
– average ~2/person after filtering
• What’s causing this?
– carriers of recessive variants
– some undiagnosed disease cases
– lots of false positive variants (20-25%)
Databases of disease mutations
• Drawn from literature collected over decades
with variable standards
• Five years ago: no large frequency databases,
= any rare protein-altering variant is causal
• New databases more careful about evidence
xBrowse: Rapid exploration of multiple
inheritance patterns
https://atgu.mgh.harvard.edu/xbrowse/
xBrowse: Filtering by function and frequency
https://atgu.mgh.harvard.edu/xbrowse/
xBrowse: Digestible information for all candidate
variants and genes
https://atgu.mgh.harvard.edu/xbrowse/
xBrowse: Digestible information for all candidate
variants and genes
exac.broadinstitute.org
xBrowse: Following up candidate genes with
external resources
https://atgu.mgh.harvard.edu/xbrowse/
The big (largely) unsolved challenges
• NGS data still misses a non-trivial number
of genetic variants, also has errors
• Our reference databases are still missing
many populations
• Uncertainty even about “known”
pathogenic variants in databases
• For many variants, penetrance is not
robustly established
• Huge difference between interpretation
in “healthy” and “disease” samples