Interpreting exomes and genomes: a beginner’s guide Daniel MacArthur Analytic and Translational Genetics Unit Massachusetts General Hospital Broad Institute of Harvard and MIT www.macarthurlab.org Twitter: @dgmacarthur Overview • Fundamentals of next-generation sequencing • Genomes, exomes and targeted panels • Genomic diagnosis: how do we filter causal variants from a patient’s entire genome? • Major challenges for NGS diagnosis Next-generation sequencing • Many different technologies Illumina Pacific Biosciences Oxford Nanopore • Can chop up DNA and read bits of fragment all at the same time – massively parallel sequencing Sequencing yields billions of reads per run CGTTACGGCAGACG TTTGAACTTTCATAG GGGACATATTCGAAAT ACGGGATGTACG TAGACATAGACGACT GGGATGTACGAA GTACTGACCAG GACCAGTAGAC GACATAGACGACT CCAGTAGACATA ACGAGCCGTAGCTA TTTGACGGGATG GGGATGTACGA AGACGACTTTGAC CGAGCCGTAGCTA ATAGACGACTTTGA GGGATGTACGAG GGGATGTATGAG TACGAGCCGTA TGTACGAGCCGTA Compare the reads to a reference genome GTACTGACCAGTAGACATAGACGACTTTGACGGGATGTACGAGCCGTAGCTA ACGGGATGTACG TAGACATAGACGACT GGGATGTACGAA GTACTGACCAG GACCAGTAGAC GACATAGACGACT CCAGTAGACATA ACGAGCCGTAGCTA TTTGACGGGATG GGGATGTACGA AGACGACTTTGAC CGAGCCGTAGCTA ATAGACGACTTTGA GGGATGTACGAG GGGATGTATGAG TACGAGCCGTA TGTACGAGCCGTA NGS allows us to sample the sequence position many times over GTACTGACCAGTAGACATAGACGACTTTGACGGGATGTACGAGCCGTAGCTA TAGACATAGACGACT ACGGGATGTATG GGGATGTATGA GTACTGACCAG TTTGACGGGATG ATGAGCCGTAGCTA GACCAGTAGAC GTACGAGCCGTA CCAGTAGACATA TGAGCCGTAGCTA GACATAGACGACT GGGATGTATGAG ATAGACGACTTTGA GGGATGTACGAG TACGAGCCGTA AGACGACTTTGAC TGTACGAGCCGTA Challenges: • Mapping short reads • Variable coverage • Base calling quality • Tend to be worse for insertions and deletions compared to SNPs C -> T (5 C / 5 T) Which technology to choose? Technology Percent of Genome Sequenced Whole Genome Sequencing >95% Whole Exome Sequencing ~1.5% (protein-coding regions) Targeted Sequencing 0.005% - 0.1% (100s – 1000s of genes) Cost Depth of Coverage Targeted sequencing Targeted sequencing The problem with exome data • Clinically and genetically heterogeneous conditions x 30,000 rows Sifting signal from noise in exomes • Every genome contains many rare, potentially functional variants – – – – – – ~500 rare missense variants ~100 LoF variants: ~20 homozygous, ~20 rare ~100 rare variants in known disease genes 5-10 recessive disease-causing mutations 1-2 de novo coding mutations sequencing errors • In Mendelian disease patients we need to find 1-2 true causal mutations amidst this “noise” How do we find pathogenic variants? 1. Is the variant a known pathogenic variant? How much evidence supports the claim of pathogenicity? 2. Is the variant rare? 3. Is it predicted to have a functional impact (change a protein sequence)? 4. Does it segregate with disease? 5. Is the gene associated with the disease? Making sense of one genome requires tens of thousands of genomes vs More than 500K exomes and 50K genomes have been sequenced worldwide but these data are siloed by project and inconsistently processed Exome Aggregation Consortium (ExAC) Sample Size (N) and Ancestral Diversity World Popula 50000 40000 East Asian Latino South Asian European African Middle Eastern African European Native American ancestry DiverseAsian Other South East Asian Other 10000 20000 30000 Scaled to ExAC h 0 Individuals with Exome Sequence Data 60000 1000 Genomes, ESP, ExAC 1000 Genomes 1000 Genomes ESP ESP exac.broadinstitute.org ExAC ExAC World proportion Value of reference databases • Provide variant frequency in a large population (either healthy, or “reference” i.e. population sample) • Provide frequency across multiple human populations • Allow us to assess how many variants we see in a particular gene • Provide an unbiased estimate of variant penetrance Lessons from ExAC • Many “healthy” people carry apparently disease-causing variants – over 20,000 reported disease variants are seen in our “healthy” samples – average ~2/person after filtering • What’s causing this? – carriers of recessive variants – some undiagnosed disease cases – lots of false positive variants (20-25%) Databases of disease mutations • Drawn from literature collected over decades with variable standards • Five years ago: no large frequency databases, = any rare protein-altering variant is causal • New databases more careful about evidence xBrowse: Rapid exploration of multiple inheritance patterns https://atgu.mgh.harvard.edu/xbrowse/ xBrowse: Filtering by function and frequency https://atgu.mgh.harvard.edu/xbrowse/ xBrowse: Digestible information for all candidate variants and genes https://atgu.mgh.harvard.edu/xbrowse/ xBrowse: Digestible information for all candidate variants and genes exac.broadinstitute.org xBrowse: Following up candidate genes with external resources https://atgu.mgh.harvard.edu/xbrowse/ The big (largely) unsolved challenges • NGS data still misses a non-trivial number of genetic variants, also has errors • Our reference databases are still missing many populations • Uncertainty even about “known” pathogenic variants in databases • For many variants, penetrance is not robustly established • Huge difference between interpretation in “healthy” and “disease” samples
© Copyright 2025 Paperzz