hqSNP, wgMLST and the WGS alphabet soup: what epidemiologists need to know Martin Wiedmann Cornell University E-mail: [email protected] Outline • Review of genomes, genes, and evolution • Use of sequence data to assess relatedness of organisms • Data analysis approaches • wgMLST and hqSNP • Trees and how to interpret them What is a SNP? • Single Nucleotide Polymorphism (SNP) ATGTTCCTC sequence ATGTTGCTC reference *phylogenetically informative differences • Insertion or Deletion (Indel) ATGTTCCCTC sequence ATGTTC-CTC reference *differences not used in hqSNP analysis Microbial evolution 101 – mechanisms of change Point mutations ACCCTCTAGTAGTAGCA ACCATCTAGTAGTAGCA ACCCTCTAGTAGTAGCA 1 SNP and one “genetic event” 4 Microbial evolution 101 – mechanisms of change Insertion or deletion ACCCTCTAGTAGTAGCA ACCATCTAG . . . TAGCA ACCCTCTAGTAGTAGTAGCA 3 differences (?) and one “genetic event” 5 Microbial evolution 101 – mechanisms of change Inversion ACCCTCTAGTAGTAGCA 6 ACCATCTCGTAGTAGCA ACCCTCTAGTAGTAGCA Alignment: ACCATCTCGTAGTAGCA ACCCTCTAGTAGTAGCA 2 SNPs and one “genetic event” Microbial evolution 101 – mechanisms of change Horizontal gene transfer of homologous gene sequences ACCCTCTAGTACTAGCATCC TCCCTCTTGTCCTACCATCA CTTGTCCTACCA CTTGTCCTACCA ACCCTCTAGTACTAGCATCC ACCCTCTTGTCCTACCATCC 7 Alignment: ACCCTCTAGTACTAGCATCC ACCCTCTTGTCCTACCATCC 3 SNPs and 1 genetic event Microbial evolution 101 – mechanisms of change 8 Transformation Transduction Case study – why does it matter • Human listeriosis outbreak in 2000 with 29 cases • Isolates show 1 SNP differences to food and human isolate from a single case linked to processing facility X in 1988 • Epidemiology support that this facility was the source of the outbreak • Some analyses approaches that did not account for recombination would have shown that human isolates from 2000 show approx. 3,000 SNP differences to 1988 food isolate from facility X • Why: Large recombination event that introduces a large prophage (viruses inserted into the bacterial genome) Outline • Review of genomes, genes, and evolution • Use of sequence data to assess relatedness of organisms • Data analysis approaches • kSNP, wgMLST, and hqSNP • Trees and how to interpret them Use of sequence data to assess relatedness of organisms • Differences in sequences can be used to assess relatedness of organisms and the likelihood of recent common ancestor • “Do the M. tuberculosis isolates from patient A and patient B share recent common ancestor” • Definition of “recent” becomes important – recent in years or generation times • Salmonella in a dry processing plant may stay dormant and rarely if ever multiply (or imagine anthrax spores in soil) • Salmonella in a chicken flock may multiply every 30 min (>7,500 times a year) • Assessing relationships of microbial isolates typically requires more information than just sequence data • Information on epidemiological relationships and other relevant data is essential Outline • Review of genomes, genes, and evolution • Use of sequence data to assess relatedness of organisms • Data analysis approaches • kSNP, wgMLST, hqSNP, and others • Trees and how to interpret them Basics of WGS Analyses • Different ways to compare the genomes of 2 different isolates • Compare the genome small piece-by-small piece to find pieces that are different • Kmer based analyses • Use a high quality (reference) sequence or genome to identify differences • hqSNP analysis • Compare genomes on a gene-by-gene (locus-by-locus) basis • wgMLST analysis • All these analysis can provide an output that provides the “number of differences” or can be sued to build trees Basics of WGS Analyses • Different ways to compare the genomes of 2 different isolates • Compare the genome small piece-by-small piece to find pieces that are different • Kmer based analyses • Use a high quality (reference) sequence or genome to identify differences • hqSNP analysis • Compare genomes on a gene-by-gene (locus-by-locus) basis • wgMLST analysis • All these analysis can provide an output that provides the “number of differences” or can be sued to build trees What makes a SNP high quality (hq)? Sequence Reads Sequence reads Sequence reads Quality filtered Sequence Reads ready for analysis Apply a quality filter that filters out nucleotides in sequence reads for comparison based on sequence coverage and quality The alphabet soup of analysis – Coverage Coverage at 40x Coverage at 5x http://missusrousselee.deviantart.com/art/AlphabetSoup-134724659 • NGS generates 100,000 or more reads per one genome sequenced • Any single location on the genome can have zero to hundreds of sequence reads that cover the one region What to call a SNP • SNPs called based on: • Quality • Coverage • Base frequency • The differences between the reference and compared genome are extracted and used to determine relatedness ATGTTACTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTTCTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTGCTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTGCTC reference Is it a SNP? Where to call a SNP? • Not all SNP pipelines are equal – where you call SNPs will affect the total SNP count • SNPs relevant for phylogenetic analysis are vertically transmitted, not horizontally, so horizontal genetic elements like phages can be masked Mobile elements genes Raw reads Mask mobile elements -do no consider SNPs in this location Only call SNPs in genes How to report SNP data – keep it simple • Hi folks: New Cluster: 2016039 Two isolates are 0 SNPs from each other: E2017003216 (SE77B52) E2017003039 (SE77B52) New Cluster: 2016040 Two isolates are 2 SNPs from each other: E2017002910 (SE1B1) I2017003132 (SE1B1) 19 MDH00841 MDH00849 Caveats of hqSNP analyses Advantages Disadvantages When to Use Phylogenetically informative (build a tree consistent with evolution of the strains) Requires a closely related reference genome – hqSNP analysis is problematic if reference genome is not closely related Good for situations where a wgMLST database has not been developed and validated. May provide highest amount of resolution for strain comparison SNP position can be identified on genome (gene affected can be identified) Takes a while and requires a lot of computer power Interpretation of data depends on genomes added – is not stable and does not lead to nomenclature Basics of WGS Analyses • Different ways to compare the genomes of 2 different isolates • Compare the genome small piece-by-small piece to find pieces that are different • Kmer based analyses • Use a high quality (reference) sequence or genome to identify differences • hqSNP analysis • Compare genomes on a gene-by-gene (locus-bylocus) basis • wgMLST analysis • All these analysis can provide an output that provides the “number of differences” or can be used to build trees Traditional MLST MLST.NET • ~6-12 housekeeping genes; usually portion of gene • Developed in the area of Sanger sequencing, providing for improve discrimination over sequencing 1 gene • Targets selected to represent population structure, not as useful for outbreak detection • Schemes are available on international publically accessible databases • combination of 6-12 genes used to name a unique sequence type (i.e. MLST profile 1-1-1-1-1-1-1 = ST1) Whole genome multilocus sequence typing (MLST) • Database is built from gene content representing a diverse selection of the genus/species of the organism being compared • Each unique gene is referred to as a “locus” – a locus may include the entire gene or a piece of the gene • Any changes – SNP, insertions, deletions – equals a new allele call for a locus • New alleles are named sequentially when encountered- not based on sequence Locus 1 ACTAGAGGGAAA allele 1 2 SNPs ACTAGAGGCTAA allele 2 1 indel ACT-GAGGGAAA allele 3 Whole genome multilocus sequence typing (MLST) • Allows for simpler analysis and clear naming of subtypes • Performs comparison on a gene by gene level Isolate A Isolate B Isolate C Locus 1 (20 nt) 1 1 1 Locus 2 (100nt) 8 8 12 Locus 3 (5000nt) 5 5 2 Locus 2,005 (5nt) 4 4 4 wgMLST type A A B Etc. The alphabet soup of analysis wgMLST • The allele calls at each locus are compared between isolates and differences are used to determine relatedness “Allele Code” Pattern Naming in the Listeria Database Pilot thresholds ― ― ― ― ― ― 10% = 300 alleles 5% = 150 alleles 2.5% = 75 alleles 1% = 30 alleles 0.5% = 15 alleles 0.25% = 7 alleles Two isolate are the same: Patient 1: 4.1.1.5.2 Patient 2: 4.1.1.5.2 The wgMLST “zip code” • Two isolate are the same: • Patient 1: 1.4.1.1.5.2 • Patient 2: 1.4.1.1.5.2 • Three isolates; patient 3 differs by 1 to 7 alleles from 1 and 2 • Patient 1: 1.4.1.1.5.2 • Patient 2: 1.4.1.1.5.2 • Patient 3: 1.4.1.1.5.4 • Four isolates; patient 4 differs by 8 to 15 alleles from the others: • • • • 31 Patient 1: 1.4.1.1.5.2 Patient 2: 1.4.1.1.5.2 Patient 3: 1.4.1.1.5.4 Patient 4: 1.4.1.1.7.1 How to report wgMLST data – keep it simple • Hi folks: New Cluster: 2016039 Two isolates are 0 alleles from each other: E2017003216 (SE77B52) E2017003039 (SE77B52) New Cluster: 2016040 Two isolates are 2 alleles from each other: E2017002910 (SE1B1) I2017003132 (SE1B1) 32 How to report wgMLST data – give me the ZIP codes • Looks like we may have a cluster • • • • • 33 Patient 1: 1.4.1.1.5.2 Patient 2: 1.4.1.1.5.2 Patient 3: 1.4.1.1.5.4 Patient 4: 1.4.1.1.7.1 Patient 4: 1.4.3.3.1.1 MLST Analysis • Faster than analyzing SNP differences • For WGS data, allele calls can be performed on short reads (“assembly free”) and assembled genomes (“assembly-based”) • If there is a conflict between the allele calls then no allele call is made 34 Advantages and Caveats of wgMLST analysis Advantages Disadvantages When to Use Phylogenetically informative Initial assignment of alleles is computationally costly (doing assemblies before calling alleles); CDCs system will call alleles directly from raw reads (~ 2 min); assemblies take about 2 h or perhaps longer; if there is a conflict between the allele calls then no allele call is made Surveillance, especially for a distributed testing network All virulence, serotyping, and antibiotic resistance genes can be pulled out as part of analysis Comparing character data (allele numbers) rather than genetic data Reference characterization Neutralizes the effects of horizontal gene transfer (event is only counted once rather than many times for hqSNPs) SNPs and indels treated equally Accurate cluster detection Allele calling is stable – data standardizable; directly comparable between laboratories; can lead to nomenclature based on allele calls, which can be used for communication and automated cluster detection; reproducibility not dependent on choice of reference strain; amenable to automated bioinformatics Requires curation for allele calls Need to communicate with partners using stable nomenclature hqSNP versus MLST Analysis • Both analyses conducted from the same raw data (typically short read sequencing data) • For public health purposes, both correlate well • i.e the outermost branches of phylogenetic trees are almost identical • The two are not mutually exclusive • For some use cases MLST works better, others SNP works better 36 Interpreting analysis data – how to build trees using WGS analysis • Use WGS analysis to infer relatedness of isolates • For wgMLST: translate the number allele difference between isolates to a measure of similarity and use that to infer branch lengths and relatedness • For hqSNP analysis – translate nucleotide differences between isolates to relatedness • Can use substitution models to estimate the cost of changing from A>T, C>A, etc. Thymine Cytosine adenine guanine How to report SNP data - trees 1 1 2 2 1 2 3 4 ATATTCCGCAA ATATTCCGCAA ATATTGCGCAA ACCTTGCGCTA 3 3 4 4 2 2 3 3 1 4 38 1 Building the tree Isolate Sequence A ggagagtta B ggatccccc C ggattatta D actgccggt ancestor actgaatta 6 Isolate B 1 ggataatta 1 Isolate C 3 ggattatta ggatccccc ggagaatta actgaatta 1 Isolate A ggagagtta 5 actgccggt Isolate D genetic change • Use the differences you identified by hqSNP or wgMLST to infer the relatedness or phylogeny Reading the trees Node Most recent common ancestor (for isolate B and C) 6 1 Leaf Taxa 1 Isolate C 3 Ancestral node Terminal node Isolate B 1 Isolate A 5 genetic change Isolate D Clade Outgroup/Root – related isolate (same PFGE pattern or 7-gene MLST) but not part of outbreak Trees, branches, and leaves – more than one way to draw a tree 2012K-1417 2012K-1550 2013K-1635 2012K-1549 20 2012K-1747 2012K-1315 SRR2759138 SRR2759145 N23600 2013K-1649 2013K-1650 2015K-0885 N18382 201 SRR SRR2759147 20 N3 13 7 K - 91 09 4 83 N1 84 57 N4 469 5 3K-0 573 275 913 74 8 31 1 K04 15 20 420 255 2K -1 2K-1 421 2K-1 201 K-17 47 201 K-1 20 12 2N0 4 15224 K- 2 04 5 27 59 14 75 6 91 43 12 K- 63 9 3K-1 8 201 3 6 3K-1 201 59 733 N6209 386 20 31 R1 N SR 17 12 20 N6 9 N1997 7 SRR1206095 54 56 12 K12 33 6220 -16 K 13 20 1 681 49 N4 5K-04 201 42 2 5K-0 15 201 K-13 2012 N 2015K-0422 SR R SR R2 48 2015K-0449 Branches that connect to the terminal node are the important branch lengths to indicate relatedness 91 N46811 75 2013K-1633 R2 SR 2012K-1256 201 N17 2012K-1254 N662 2012 N42242 2015K-0451 2013K-163 2012K-1417 2012K-1550 50 2015K-0431 04 K- 2012K-1255 2012K-1748 5 2012K1549 15 Many different ways to display trees 2012K-1421 2012K-1420 SRR1206097 59138 SRR27 N19978 SRR1206088 SRR2759149 SRR1206085 2013K-0982 9145 SRR275 2013K-1636 2015K-0430 SRR1206090 2015K-0886 1307MNGX 1307MNGX6-1 6-1 1307MNGX 1307MNGX6-1 6-1 2015K-0421 N23600 1307MNGX 1307MNGX6-1 6-1 SRR1206091 2013K-1649 SRR1206094 1307MNGX 1307MNGX6-1 6-1 1307MNGX 1307MNGX6-1 6-1 SRR2759142 1307MNGX 1307MNGX6-1 6-1 5.00 5.00 1307MNGX 1307MNGX6-1 6-16.00 6.00 4.00 4.00 1307MNGX 1307MNGX6-1 6-1 -0445 2015K 24 K-04 2015 275 3K-1 201 574 3K-0 201 1.00 1.00 21382 1312MLGX 1312MLGX6-1 6-1 1307MNGX 1307MNGX6-1 6-1 45.00 45.00 20.00 20.00 2015K-0447 1312MLGX 1312MLGX6-1 6-1 47.00 47.00 SRR2759150 00 SRR1206087 SRR1206096 N2 169.00 169.00 60.00 60.00 2 1312MLGX 1312MLGX6-1NOT 6-1NOT 3 01 1312MLGX 1312MLGX6-1 6-1 110.00 110.00 1312MLGX 1312MLGX6-1NOT 6-1NOT 2013K-0979 1312MLGX 1312MLGX6-1 6-1 20 1312MLGX 1312MLGX6-1 6-1 2015K-0424 20 5.00 5.00 25.00 25.00 58.00 58.00 SRR2759142 1312MLGX 1312MLGX6-1 6-1 SRR1206094 SRR1206090 SRR1206085 1312MLGX 1312MLGX6-1 6-1 N19978 N19977 SRR2759139 31.00 31.00 2013K-0573 N44695 N18457 2013K-0983 1312MLGX 1312MLGX6-1 6-1 N37914 SRR2759143 SRR2759146 SRR2759148 2015K-0450 2012K-1550 2012K-1417 2012K-1421 2013K-1635 2012K-1549 2012K-1420 2012K-1747 2012K-1255 N42242 2015K-0431 2015K-0451 2012K-1748 N17 2012K-1256 2013K-1633 2012K-1254 N662 2012K-1315 N46811 2015K-0449 SRR2759145 SRR2759138 2015K-0422 2013K-1649 2013K-1650 N23600 2015K-0885 SRR1206097 SRR1206095 N18382 SRR1206088 SRR2759147 SRR2759149 2013K-0982 2015K-0886 2013K-1636 2015K-0430 N41751 SRR1206091 2015K-0421 N43452 2015K-0432 2015K-0447 SRR2759152 2015K-0423 SRR2759150 SRR2759140 2013K-1274 SRR1206087 SRR1206096 N46812 2013K-0980 SRR2759144 2015K-0448 2013AM-0303 N28605 2013K-0979 2015K-0446 N20030 SRR1206092 SRR1206086 2013K-0574 2013K-1275 SRR2759142 2015K-0429 2015K-0445 2015K-0424 SRR1206085 SRR1206094 SRR1206090 SRR2759139 N19977 2013K-0573 N19978 N44695 2015K-0450 N13150 21382 2013K-1361 21383 32452 N37914 N18457 SRR2759143 SRR2759146 SRR2759148 2013K-1638 2013K-1639 2013K-0983 SRR1206093 N31386 N27359 2013K-1638 N27359 2013K-1639 SRR1206093 N31386 N13150 1312MLGX 1312MLGX6-1 6-1 2015K-0445 2015K-0429 21382 1312MLGX 1312MLGX 6-1 6-1 1312MLGX 1312MLGX 6-1 6-1 N20030 2013K-0574 2013K-1275 2 52.00 52.00 345 48.00 48.00 46.00 46.00 53.00 53.00 SRR1206092 N4 SRR2759144 SRR1206086 60 52 324 83 213 2013AM-0303 2015K-0446 92 20 86 60 20 6 R1 44 -0 SR K 15 20 R1 SR 2015K-0448 1 21 09 04 K- 206 15 1 20 SRR 1 6175 32 13 1 K- N4 K-04 5 13 20 201 2013K-0980 N28605 30 79 09 K- 1312MLGX 1312MLGX6-1 6-1 N46812 SRR2759152 SRR2759140 2013K-1274 SRR2759152 2015K-0423 2013K-1274 27 AM 591 4 -0 30 4 3 N13150 N18382 SRR 2759 147 SRR 120 609 5 SR R12 060 SR 97 R1 206 SR 088 R2 75 20 91 13 49 K20 09 82 20 13K -1 15 K- 636 04 30 20 15 K08 86 29 2015K-04 2.00 2.00 1307MNGX 1307MNGX6-1 6-1 21383 SR R 32452 2015K-0423 N43452 3.00 3.00 15 N K- 2 20 04 86 13 K-0 48 05 98 0 SR R12 0 6 SRR N4096 120 608 6812 SRR 7 2759 150 SRR275 9140 2015K-04 47 45.00 45.00 2015K-0432 2013K-1650 2015K-08 85 20.00 20.00 16.00 16.00 13 N41751 2013K-1361 Trees, branches and leaves – reading the trees • Difference between similarity and relatedness on the tree • Isolate A and C are more similar to each other than C and B are • Isolate C and B are more related to each other than C and A are 6 Isolate B 1 ggataatta 1 Isolate C 3 ggattatta ggatccccc ggagaatta actgaatta 1 Isolate A ggagagtta 5 actgccggt genetic change Isolate D Trees, branches and leaves – what does it mean for my outbreak investigation • Epidemiologic data provides context to the tree – cannot rely on phylogenetic tree to identify outbreak source 5 ggatccccc 1 ggataatta 1 3 ggattattaStool ggagaatta 1 actgaatta kale ggagagtta 5 actgccggt genetic change stool spinach wgMLST–based phylogenetic Tree 1307MNGX 1307MNGX6-1 6-1 1307MNGX 1307MNGX6-1 6-1 1307MNGX 1307MNGX6-1 6-1 1307MNGX 1307MNGX6-1 6-1 1307MNGX 1307MNGX6-1 6-1 20.00 20.00 3.00 3.00 • Minimum spanning tree (MST) Crave Brothers 16.00 16.00 1307MNGX 1307MNGX6-1 6-1 45.00 45.00 5.00 5.00 1307MNGX 1307MNGX6-1 6-16.00 6.00 2.00 2.00 1307MNGX 1307MNGX6-1 6-1 4.00 4.00 1307MNGX 1307MNGX6-1 6-1 1.00 1.00 1312MLGX 1312MLGX6-1 6-1 1307MNGX 1307MNGX6-1 6-1 45.00 45.00 20.00 20.00 1312MLGX 1312MLGX6-1 6-1 47.00 47.00 169.00 169.00 60.00 60.00 1312MLGX 1312MLGX6-1 6-1 1312MLGX 1312MLGX6-1NOT 6-1NOT 1312MLGX 1312MLGX6-1 6-1 110.00 110.00 48.00 48.00 46.00 46.00 53.00 53.00 52.00 52.00 New subgroup 1312MLGX 1312MLGX 1312MLGX 1312MLGX 6-1 6-1 6-1 6-1 1312MLGX 1312MLGX6-1 6-1 kale 5.00 5.00 1312MLGX 1312MLGX6-1 6-1 25.00 25.00 1312MLGX 1312MLGX6-1 6-1 58.00 58.00 1312MLGX 1312MLGX6-1 6-1 1312MLGX 1312MLGX6-1 6-1 31.00 31.00 1312MLGX 1312MLGX6-1 6-1 1312MLGX 1312MLGX6-1NOT 6-1NOT • Unrooted • Depicts genomes in a network and branch lengths show relatedness of isolates (number of allele differences) MDH00215 -Sporadic 4/19/01 MDH00247 --Sporadic 8/6/12 MDH00204 - Sporadic 5/14/01 MDH00221- Sporadic 5/14/01 MDH00203 - Sporadic 7/11/00 MDH00214 - Sporadic 3/12/01 MDH00206 - Sporadic 8/23/00 MDH00217 - Sporadic 6/10/13 MDH00248 - Sporadic 6/10/13 MDH00237 Sporadic 6/22/11 MDH00236 - Sporadic 5/7/11 MDH00207 - Sporadic 8/31/2000 MDH00233 - Sporadic 12/7/2001 MDH00205 - Sporadic 8/22/2000 MDH00216 - Sporadic 4/30/2001 MDH00224 -Sporadic 6/11/2001 MDH00254 MDH00252 MDH00253 MDH00234 MDH00226 - Sporadic 6/21/2001 MDH00231 - Sporadic 7/16/2001 MDH00202 - Sporadic 7/7/2000 MDH00208 - Sporadic- Same time, PFGE, and MLVA as Outbreak 1 MDH00209 MDH00210 MDH00211 0-2 SNPs Defined Outbreak Samples Outbreak 1- Sept 2000 Outbreak 2- May 2001 Outbreak 3- Aug 2001 Outbreak 4- Nov 2003 Outbreak 5- Aug 2008 Outbreak 6- Spring 2014 Outbreak 7- Spring 2014 Taylor et al. J Clin Micro Oct 2015. MDH00222- In-vivo, same as E2001001070 MDH00223 MDH00219 MDH00225- In-vivo, same as E2001001070 MDH00228- In-vivo, same as E2001001070 MDH00220 MDH00218 0-2 SNPs 0-2 SNPs MDH00213- Sporadic- Same PFGE and time as Outbreak 1 MDH00232- Sporadic 10/17/01 MDH00227 MDH00230 MDH00251 MDH00229 MDH00235- Sporadic 10/3/05 MDH00243- Sporadic, same PFGE and time as Outbreak 5 MDH00245- Sporadic 6/26/12 MDH00249 MDH00250 MDH00246-Sporadic 7/30/12 MDH00255- OH Sample 1 MDH00256- OH Sample 2 MDH00241- Sporadic, same PFGE and time as Outbreak 5 MDH00239 MDH00242 MDH00244- Environmental sample from Outbreak 5 MDH00238 MDH00240 0-1 SNPs 0 SNPs 1SNP 0-3 SNPs Take Home Messages • Molecular epidemiology requires collaborations between epidemiologists and the lab • Microbial isolates can accumulate genetic differences through a variety of mechanisms (e.g., horizontal gene transfer) • The approach data analyses use to deal or not deal with these different evolutionary mechanisms can play an important role • hqSNP and wgMLST both address and account for horizontal gene transfer, but in different ways • Different organisms differ in their lifestyles and mechanisms of evolution • Need to know your epi and your bugs 47 Acknowledgments • Centers for Disease Control and Prevention • • • • Heather Carleton Greg Armstrong Peter Gerner-Smidt John Besser • Integrated Food Safety Centers of Excellence 48 Questions 49
© Copyright 2026 Paperzz