Journal of Medical Microbiology (2004), 53, 35–45 DOI 10.1099/jmm.0.05365-0 Identification and interrogation of highly informative single nucleotide polymorphism sets defined by bacterial multilocus sequence typing databases Gail A. Robertson,1 Venugopal Thiruvenkataswamy,1 Hayden Shilling,2 Erin P. Price,1 Flavia Huygens,1 Frans A. Henskens2 and Philip M. Giffard1 Correspondence Philip M. Giffard 1 Cooperative Research Centre for Diagnostics, Queensland University of Technology (Gardens Point Campus), GPO Box 2434 Brisbane, Queensland 4001, Australia [email protected] 2 Discipline of Computer Science and Software Engineering, University of Newcastle, Newcastle, New South Wales, Australia Received 26 June 2003 Accepted 11 September 2003 A unified, bioinformatics-driven, single nucleotide polymorphism (SNP)-based approach to microbial genotyping has been developed. Multilocus sequence typing (MLST) databases consist of known variants of standardized housekeeping genes. Normally, seven fragments are defined; a sequence type (ST) consists of the variants of these fragments that are found in a particular isolate. A computer program that can identify highly informative sets of SNPs in entire MLST databases has been constructed. The SNPs either define a particular user-specified ST or provide a high value for Simpson’s index of diversity (D), and may thus be generally applicable to that species. SNP sets that are diagnostic for Neisseria meningitidis ST-11 and ST-42, and high-D SNP sets for N. meningitidis and Staphylococcus aureus, were identified and real-time PCR methods to interrogate these SNPs were demonstrated. High-D SNP sets were also identified in other MLST databases. This widely applicable approach allows rapid genetic fingerprinting of infectious agents. INTRODUCTION The last decade has seen a great increase in quantity of gene sequence data that are contained in searchable and downloadable databases. These data are now acquiring a significant second dimension, in that there has been a recent rapid accumulation of data concerning the intraspecific variation of genes. Important examples of this are the single nucleotide polymorphism (SNP) databases for humans (Sachidanandam et al., 2001) and the multilocus sequence typing (MLST) databases for infectious micro-organisms (Maiden et al., 1998). MLST databases consist of known variants of standardized housekeeping genes. Normally, seven fragments are defined; a sequence type (ST) consists of the variants of these fragments that are found in a particular isolate. Known intraspecific variation can be interrogated to generate genetic fingerprints. For genetic fingerprinting procedures to be useful, their resolving power must be known. In the case of human SNPs, low average linkage between SNPs and complete allelic reassortment at each generation define Abbreviations: CT , cycles to threshold; ˜CT , difference in cycles to threshold; D, Simpson’s index of diversity; MLST, multilocus sequence typing; MRSA, methicillin-resistant Staphylococcus aureus; SNP, single nucleotide polymorphism; ST, sequence type. the algorithms used to deduce the resolving power of SNP combinations (Krawczak, 1999). Bacteria differ in that there is usually a single haploid chromosome, recombination occurs sporadically and at different rates in different species, a given recombination event will probably not encompass the entire genome, and mutation and recombination may occur at comparable rates (Feil et al., 2001). As a consequence, in some bacterial species, a degree of linkage may exist over an appreciable fraction of the genome and observed de-linkage may either be due to recombination or the appearance of new alleles as a result of point mutation. This prevents estimation of the resolving power of bacterial SNPs by using methods that were developed for sexually reproducing diploid organisms. The aim of this study was to develop a straightforward approach to SNP-based bacterial typing. We have made use of computer software that can identify highly informative sets of SNPs in databases of gene sequence variants and applied this to MLST databases for a number of bacterial pathogens. The algorithms used are free of assumptions concerning modes of generation of diversity. We have also developed kinetic PCR assays for interrogation of informative sets of SNPs that were identified in the Neisseria meningitidis and Staphylococcus aureus MLST databases. Kinetic PCR does not require any fluorescently labelled probes and is potentially very cost-effective. It is similar in Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 28 Jul 2017 16:57:08 05365 & 2004 SGM Printed in Great Britain 35 G. A. Robertson and others principle to the allele-specific PCR/amplification refractory mutation system family of SNP interrogation methods, which depend on reduced extension from 39 mismatched PCR primers (Germer et al., 2000). In the non-time-resolved format, these methods are difficult to optimize, as any extension from a mismatched primer provides a template for unimpeded amplification (Giffard et al., 2001). However, in the real-time format, optimization is easier as SNP calling can be done on the basis of a difference in the number of cycles to threshold (˜CT ) between allele-specific reactions, rather than a difference between the amounts of amplified material at the reaction end points. METHODS Bacterial strains. Seven N. meningitidis isolates were used (Table 1); these were obtained from Queensland Health Scientific Services. The STs of these isolates were determined as part of this study. The 11 S. aureus isolates that were used were all methicillin-resistant S. aureus (MRSA) from Australian healthcare facilities (Table 1). Isolates 66460/ 98 (ST-30), F829549 (ST-88), IP01M2046 (ST-88) and MC8011535 (ST-88) were from the Queensland Health Pathology Service collection and their STs were determined as part of this study. The remaining isolates (two with ST-30, three with ST-239 and two with ST-1) were from the Australian Group on Antimicrobial Resistance collection. The STs of these were determined as part of a study that is as yet unpublished. Downloading of MLST databases. All MLST data were downloaded from http://www.mlst.net in the form of alignments in FASTA format. The first 2510 STs were used for all analyses of N. meningitidis STs unless otherwise indicated, whereas for S. aureus, the following STs were available for download and were used for all analyses unless otherwise indicated: 1–103, 109, 120, 121, 123, 134, 145, 169, 178, 182, 188–190, 192–205, 207–215, 217, 220–222, 225, 228, 231, 235, 238–241, 243, 246–247 and 250. Identification of informative SNP sets. SNP sets were identified by using a computer program called ‘Minimum SNPs’. It is designed to provide a suite of methods for extraction of informative SNP sets from comparative sequence data. The functions of Minimum SNPs can be summarized as follows: (1) Minimum SNPs can either carry out SNP searches on alignments derived from individual MLST loci or convert an MLST database into a single concatenated alignment that can be searched in a single operation. (2) Highly informative sets of SNPs are identified by an approximation that we have termed the ‘anchored method’. This entails initial identification of the most informative SNP, which is termed SNP1, then identification of the next SNP (SNP2) that, in combination with SNP1, forms the most informative pair. This process is iterated, creating progressively larger combinations until a preset target, based on either informative power or number of SNPs, is reached. If two or more sets of SNPs exhibit equal informative power, then the program will provide all these sets as output. Given that this may occur at more than one point in the analysis, large numbers of different sets of SNPs may be produced. Cumulative informative power for each SNP identified is included in the output. (3) The minimum number of loci required to define any given ST can be calculated. (4) The informative power of sets of SNPs can be calculated by two different methods, depending on the user’s requirements. The ‘specified variant’ algorithm identifies SNPs that are diagnostic for a userspecified variant, i.e. one sequence in the alignment. Such SNPs are most useful for efficient determination of whether or not an unknown isolate is a variant of particular interest. If this option is selected, the program assesses informative power by calculating the proportion of non-selected variants that have the same polymorphs as the selected variant at the SNPs being assessed (the term ‘polymorph’ is used to denote a particular state of an SNP; this is to avoid confusion with the term ‘allele’, which can be used to denote variants of MLST loci). The ‘generalized’ algorithm identifies sets of SNPs that are suitable for the efficient determination of whether two isolates are likely to be the same or different. If this option is selected, the informative power of sets of SNPs is assessed by calculating Simpson’s index of diversity (D) (Simpson, 1949; Hunter & Gaston, 1988). In this context, this is the probability that two sequences in the alignment, selected at random without replacement, will type differently if interrogated at the SNPs in question. This is calculated as follows: Table 1. Bacterial isolates used in this study Species/ST Allelic profile N. meningitidis: ST-8 2, 3, 7, 2, 8, 5, 2 ST-11 2, 3, 4, 3, 8, 4, 6 ST-23 ST-32 ST-42 S. aureus: ST-1 ST-30, -88 ST-239 ST-? 36 No. isolates 1 3 10, 5, 18, 9, 11, 9, 17 4, 10, 5, 4, 6, 3, 8 10, 6, 9, 5, 9, 6, 9 1 1 1 1, 1 ,1, 1, 1, 1, 1 2, 2, 2, 2, 6, 3, 2 22, 1, 14, 23, 12, 4 2, 3, 1, 1, 4, 4, 3 2 3 3 3 Strain 98M3462 94M1541 02M5007, 97M2363 02M5044 99M2717 94M2114 – – – – Serological phenotype B : 4: P1.7, 4 C : 2a : P1.5, 2 C : 2a : P1.5 Y : NT: P1.5 B :1 5 : P1, 7 C : NT : P1, 2 – – – – Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 28 Jul 2017 16:57:08 Journal of Medical Microbiology 53 Bacterial typing based on computer-identified SNP sets D ¼ 1 1=[n(n 1)] s X Integrity of kinetic PCRs was checked routinely by determining melt curves of the products, to ensure that melting-points were consistent with the size of the expected product and not with primer-dimers, and also to ensure that a single peak was produced. n j (n j 1) J¼1 where n is the number of sequences in the alignment, s is the number of classes defined by the SNPs under test and nj is the number of sequences in the jth class. (5) The program allows the performance of sets of SNPs to be explored fully by incorporating a function that returns STs that conform to any combination of SNPs and polymorphs defined by the user. The program is menu-driven, with a comprehensive graphical user interface. It was written by using the Java programming language (http://java.sun.com). Its engineering is in accordance with the objectoriented methodology (Booch, 1993). This structure facilitates future adaptation of the software to implement, for example, a web-based interface that would allow clients to download a relatively small applet that communicates with a remote-server-based compute and storage engine. The decision to use Java rather than, for instance, C++ as the implementation language was predicated on a desire for maximum portability, as the program files generated by the Java compiler on one platform do not require recompilation to run on any alternative platform for which there is an available Java Virtual Machine. Thus, while the software is currently in use on Microsoft Windows-based PCs, it will also run on systems such as Solaris, Linux or Macintosh OS-X. It is recognized that programs written in Java execute more slowly than those written in languages that compile to native code for a given architecture (e.g. C++), but the benefits of portability were felt to far outweigh the slight performance penalty that is imposed by execution in a virtual environment. Moreover, the data structures used to store SNP data and the algorithms used to manipulate those data were chosen carefully, with maximum efficiency as a primary goal. Interrogation of SNPs by kinetic PCR. All reactions were carried out in an Applied Biosystems ABI 7000 Sequence Detection system by using Applied Biosystems SYBR Green PCR MasterMix. Primer design was carried out with the assistance of Primer Express 2.0 (Applied Biosystems) on sequences that were aligned by using Clustal X 1.8 (Jeanmougin et al., 1998) [accessed through the Australian National Genome Information Service (ANGIS), Sydney, Australia]. Primer sequences are listed in Table 2. Primers were obtained from Proligo and were unlabelled and purified by desalting only. For N. meningitidis, cells were grown on chocolate agar that was made as follows: 36 g GC agar base (Oxoid), 10 g haemoglobin powder (Oxoid), 0.4 of one vial of Vitox (Oxoid) and water to 1000 ml. To prepare genomic DNA, individual colonies were suspended in 400 mM Tris/ EDTA buffer (10 mM Tris, 1 mM EDTA, pH 8.0) and boiled for 6 min, then centrifuged at 13 400 g for 5 min to pellet debris. Aliquots of the supernatant were transferred to fresh microfuge tubes and stored at 82 8C. Reactions contained 1 l extract, 2.5–5 pmol each primer and 13 SYBR Green PCR MasterMix in a final volume of 20 l. A two-step temperature cycling protocol was used: 50 8C for 2 min and 95 8C for 10 min, followed by 40 cycles of 95 8C for 15 s and 59 8C for 30 s. For S. aureus, genomic DNA was prepared from 1 ml aerated overnight culture in brain heart infusion broth (Oxoid) by using a Qiagen DNA Extraction kit according to the manufacturer’s instructions, with the exception that to lyse the cells, cell pellets were incubated for 30 min at 37 8C in 180 l lysostaphin (200 g ml1 ). DNA was stored at 20 8C. Reactions contained 2 l template DNA, a final concentration of 0.53 SYBR Green PCR MasterMix and 6.25 pmol each primer and were made up to final volume of 25 l. A three-step temperature cycling procedure was used: 50 8C for 2 min and 95 8C for 10 min, followed by 40 cycles of 95 8C for 15 s, 56 8C for 10 s and 72 8C for 33 s. http://jmm.sgmjournals.org ˜CT values were calculated by subtracting the CT of the matched primer/template pair from the CT of the mismatched primer template. CT values were obtained by using the default threshold setting. MLST determination. Genomic DNA was prepared as for kinetic PCR reactions. Amplification and sequencing primers and amplification conditions were as recommended at http://www.mlst.net/, with the exception that, for the N. meningitidis fumC and pgm loci, the recommended sequencing primers were used for amplification as, in our hands, the recommended amplification primers were unreliable in yielding product. Sequencing reactions were carried out by using ABI PRISM BigDye Terminator mix (version 3.1) according to the manufacturer’s instructions. Sequencing products were electrophoresed at the Australian Genome Research Facility (University of Queensland, Brisbane, Australia). RESULTS AND DISCUSSION Experimental strategy The strategy pursued involved initial studies that used noncomputer-aided sequence analysis, which confirmed that informative SNP sets, composed of small numbers of SNPs, are indeed defined by MLST databases (data not shown). This was followed by software engineering, with incremental improvements to the program over a period of time. The software was used to test different data-mining strategies and, finally, in order to demonstrate the potential of this genotyping approach, we reduced to practice procedures for interrogation of potentially useful examples of SNP sets. Most of our focus was on N. meningitidis and S. aureus, as both of these species are epidemiologically interesting, but they differ greatly with respect to diversity in MLST loci. Identification of informative SNPs in the N. meningitidis MLST database SNPs that were diagnostic for N. meningitidis ST-11 and ST-42 were identified. ST-11 was chosen as this is the ST of the electrophoretic type-15 variant that has been responsible for outbreaks in the Czech Republic, Greece and Canada and is also isolated frequently in Australia (Tribe et al., 2002). ST42 is a member of lineage 3, which is a genetically related cluster of disease-causing strains (Caugant et al., 1990). Two methods were used to identify SNP sets. Firstly, a semiempirical, two-step strategy was adopted, which used an earlier version of Minimum SNPs that was unable to accommodate concatenated databases. This entailed identification of subsets of loci that defined these STs with a high degree of reliability, identification of SNPs that are diagnostic for alleles at these loci and use of these SNPs as a pool for empirical searches for highly discriminatory subsets. Discriminatory powers of these subsets were assessed by using the Minimum SNPs facility that returns STs that conform to Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 28 Jul 2017 16:57:08 37 G. A. Robertson and others Table 2. Kinetic PCR primers Bases that provide allele specificity are underlined. SNP Primer name N. meningitidis ST-specific SNPs: ST-11-specific: fumC435 fumC435-T* fumC435-C fumC435-Rev† pdhC12 pdhC12-T pdhC12-C pdhC12-For ST-42-specific: abcZ411 abcZ411-T abcZ411-C abcZ411-For aroE455 aroE455-A aroE455-G aroE455-For fumC201 fumC201-A fumC201-G fumC201-Rev pdhC274 pdhC274-C pdhC274-T pdhC274-For N. meningitidis high-D SNPs: pgm93 pgm93-A pgm93-C pgm93-G pgm93-Rev aroE283 aroE283-A aroE283-C aroE283-G aroE283A-T{ aroE283G-T aroE283-Rev fumC114 fumC114-C fumC114-T fumC114-For abcZ183 abcZ183-T abcZ183-C abcZ183-G abcZ183-For abcZ54 abcZ54-C abcZ54-T abcZ54-Rev gdh60 gdh60-A gdh60-G gdh60-Rev pdhC103 pdhC103-C pdhC103-T pdhC103-For S. aureus high-D SNPs: arcC210 arcC210-C arcC210-T arcC210-A arcC210-For 38 Primer sequence (59!39) ACCATTCCCTGATGCTGGTTACT CCATTCCCTGATGCTGGTTACC CAGCAAGCCCAACTCAACG CCTTTCAAGATGTCTTGTTCCGCA CTTTCAAGATGTCTTGTTCTGCG CGTGTTCTACTACATCACCCTGATG CAAGTTCGACAATCCGCGTA CGAGTTCGACAATCCGCGTG CTTGGTCGTCATTACCCACGA TGTATTCGATAACAGGGCGGATATT TGTATTCGATAACGGGGCGGATATC TGGGTATGCTGGTCGGTCA CGACCCAATGCGAAGCA CGACCCAATGCGAAGCG GTAACGTCGTTGCCGAACACT GGACCGTCATGACCTTGCAG GGACCGTCATGACCTTGCAA GAACGCTTCAACCGCCTG CCGCAATCCTAAAGCCAAAGTA CCCGGCGCGAAAGTC CCGCAATCCTAAAGCAAAAGTG CCGCCGTGTTCTTTAATCCA GGTCAGATTCCCGGTATTCCA CAGCTTCCTGCCGTCAGC GGTCAGATTCCCGATATTCCG AGCTTCCGGCCGTCAAT GTCAGCTTCCTGCCGTCAGT CCGTACACCATATCGTAGGCAAG TTCGCCCAAACCGCAG TTCGCCCAAACCGCAA AATCGCCAACGACATCCG GTTTTCTGGCAAACCAAGTTCA CCGGCAAACCGAGTTCG CCGGCAAACCGAGTTCC GAAGCGAAGGACGGCTGG GCGATTTATTGCGCCGTTAC GCGATTTATTGCGCCGTTAT GCCGTCCTTCGCTTCGAT CAGCTGACCATCGCCGAA CAGCTGACCATCGCCGAG TTTGCACCATATCGCGCA CCGGCAATGACTTCTTGCAG CCGGCAATCACTTCTTGCAA AAAGGTATGTACCTGCTGAAAGCC CGTATAAAAAGGACCAATTGGTTTG CGTATAAAAAGGACCAATTGGTTTA CGTATAAAAAGGACCAATTGGTTTT TATGATAGGCTATTGGTTGGAAACTG Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 28 Jul 2017 16:57:08 Journal of Medical Microbiology 53 Bacterial typing based on computer-identified SNP sets Table 2. cont. SNP tpi243 arcC162 tpi241 yqiL333 aroE132 gmk129 Primer name tpi243-A tpi243-G tpi243-Rev arcC162-T arcC162-A arcC162-Rev tpi241-G tpi241-A tpi241-Rev yqiL333-C yqiL333-T yqiL333-Rev aroE132-A aroE132-G aroE132-Rev gmk129-C gmk129-T gmk129-Rev Primer sequence (59!39) GTAAATCATCAACATCTGAAGATGCA GTAAATCATCAACATCTGAAGATGCG CTTCTTTGCTTGATAAGTCAGCAATAG GTGATAGAACTGTAGGCACAATCGTT GTGATAGAACTGTAGGCACAATCGTA GGGTTATTGAATCGTGGATCATC GGTAAATCATCAACATCTGAAGATG GGTAAATCATCAACATCTGAAGATA CTTCTTTGCTTGATAAGTCAGCAATAG TGCTTGTCAACAACAGTCGCTTC TGCTTGTCAACAACAGTCGCTTT TCTGTTAAACCATCATATACCATGCTATC GGCTTTAATATCACAATTCCTCATAAAGAA GGCTTTAATATCACAATTCCTCATAAAGAG CTTGTCATCTTTTATCAAAACAGTGTTAAC GGATGCGTTTGAAGCTTTAATC GGATGCGTTTGAAGCTTTAATT TTGTATCTTTAACATATTGAACTGGTGTAC *Allele-specific primers are named according to the position of the SNP and the base in the sense strand at the SNP. The 39 end of these primers always coincides with the SNP. †Common primers are named according to the position of the SNP and whether the primer is oriented as for the antisense strand (-Rev) or the sense strand (-For). {Primers aroE283A-T and aroE283G-T were used together in the T-specific reactions. Degenerate positions in these primers are underlined. any user-defined combination of SNPs and polymorphs. Secondly, SNPs that were diagnostic for these STs were identified in a single step by using the concatenated database. The results of the semi-empirical method were as follows: for ST-11, T at fumC435 and C at pdhC12 excluded 98.1 % of known STs, whereas for ST-42, T at abcZ411, A at aroE455, A at fumC201 and T at pdhC274 excluded 97.7 % of known STs. It was concluded that significant resolution may be obtained from a small number of SNPs. The results of the direct approach that used the concatenated database were broadly similar. With this method, it was possible to identify sets of SNPs that would, within the context of known STs, identify these STs unambiguously. For ST-11, 26 SNPs were required. These are listed, with the polymorph present in ST-11 and the cumulative percentage of SNPs discriminated from ST-11, as follows: pgm124, A (95.2 %); pdhC12, C (98.0 %); fumC435, T (98.4 %); gdh132, T (98.7 %); abcZ27, T (98.8 %); adk108, C (99.0 %); aroE352, A (99.2 %); gdh60, G (99.2 %); abcZ366, C (99.3 %); abcZ375, G (99.3 %); adk29, G (99.4 %); adk135, A (99.4 %); adk189, C (99.4 %); adk371, A (99.5 %); aroE43, C (99.5 %); aroE126, C (99.6 %); aroE169, A (99.6 %); aroE207, C (99.6 %); gdh290, G (99.7 %); gdh339, T (99.7 %); pdhC201, C (99.8 %); pgm106, A (99.8 %); pgm276, C (99.8 %); pgm373, G (99.9 %); pgm430, G http://jmm.sgmjournals.org (99.9 %); pgm433, G (100.0 %). For ST-42, only eight SNPs were required for unambiguous identification: abcZ411, T (88.4 %); gdh129, T (95.6 %); abcZ423, C (98.9 %); aroE82, T (99.5 %); fumC9, G (99.8 %); pdhC129, A (99.9 %); adk21, T (99.9 %); gdh492, C (100.0 %). This demonstrates that it is technically feasible to identify STs unambiguously by using SNPs, assuming that the unknown sample has a known ST. However, it is striking that for ST-11, 20 SNPs are required to move from 99 to 100 % discrimination, and that 95.2 % discrimination is obtained with just one SNP. For many applications, interrogation of a very small number of SNPs may provide useful information. SNPs that are discriminatory for specified STs are potentially very useful, but only for specialized applications. In order to define a set of SNPs that could be used to derive a discriminatory fingerprint from any N. meningitidis isolate, the Minimum SNPs program was used with the D option selected as the criterion for assessment of the discriminatory power of SNP sets. A set of seven SNPs was identified and their cumulative D-values were: pgm93, 0.653; aroE283, 0.871; pdhC456, 0.933; fumC114, 0.965; abcZ54, 0.980; abcZ183, 0.988; fumC330, 0.992. This demonstrates that a small number of SNPs can provide potentially useful resolution. Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 28 Jul 2017 16:57:08 39 G. A. Robertson and others Identification of informative SNPs in the S. aureus MLST database The S. aureus MLST database is much smaller than the N. meningitidis database, and the level of sequence diversity is much less. Therefore, it was of interest to explore the relationship between the number of SNPs identified and resolving power for this species. Firstly, SNPs that were diagnostic for a subset of the clonal complex progenitors and one single-locus variant, as defined by Enright et al. (2002), were identified. Clonal complex progenitors were tested because we conjectured that these may be difficult to define unambiguously, as they would need to be discriminated individually from each descendant. The STs chosen were the clonal complex progenitors ST-30, ST-22 and ST-45, and the ST-30 single-locus variant ST-36. SNPs that identify these STs unambiguously are as follows. For ST-30, 14 SNPs were required: glpF66, T (83.7 %); tpi241, G (88.3 %); arcC78, A (90.9 %); pta387, G (92.8 %); pta181, G (94.1 %); arcC165, A (94.8 %); aroE121, C (95.4 %); aroE310, G (96.1 %); aroE362, C (96.7 %); glpF124, G (97.4 %); gmk331, C (98.0 %); gmk390, A (98.7 %); pta115, A (99.3 %); tpi158, C (100.0 %). For ST-22, seven SNPs were required: tpi13, A (96.1 %); aroE184, G (96.7 %); glpF128, G (97.4 %); glpF213, C (98.0 %); gmk303, A (98.7 %); gmk340, G (99.3 %); gmk373, A (100 %). For ST-45, six SNPs were required: glpF57, A (95.4 %); aroE178, G (96.7 %); tpi60, T (98.0 %); aroE270, T (98.7 %); gmk6, A (99.3 %); pta357, T (100 %). For ST-36, two SNPs were required: pta181, A (99.3 %); glpF178, G (100 %). As expected, the single-locus variant required markedly fewer SNPs for unambiguous identification than the clonal complex progenitors. In addition, during these searches, the program settings were adjusted so that up to 10 SNP sets would be selected for each ST if they were comparable in performance. For the clonal complex progenitor STs, 10 essentially equivalent pathways to 100 % were identified. However, in the case of the single-locus variant ST-36, the set of two SNPs was the only set identified. These results are consistent with the notion that in the case of clonal complex progenitors, each descendant must be discriminated separately, whereas single-locus variants need only be discriminated from the progenitor. In order to obtain a set of SNPs with generalized ability to discriminate S. aureus STs, a set of SNPs that provides a high D-value was identified. These seven SNPs and their cumulative D-values are: arcC210, 0.51; tpi243, 0.75; arcC162, 0.84; gmk129, 0.90; aroE132, 0.92; yqi333, 0.94; tpi241, 0.95. Several alternatives to this set of SNPs with essentially equivalent performance were identified. It is clear that, although a potentially useful level of discrimination was achieved, the D-values did not rise as swiftly as with the N. meningitidis database. 40 Kinetic PCR interrogation of N. meningitidis SNPs In order to test the practicality and robustness of kinetic PCR interrogation of SNPs in N. meningitidis DNA, kinetic PCR assays were developed for: (a) SNPs diagnostic for ST-11 and identified by the semi-empirical two-step method (fumC435 and pdhC12); (b) SNPs diagnostic for ST-42 and identified by the semi-empirical two-step method (abcZ411, aroE455, fumC201 and pdhC274); (c) the high-D SNP set (pgm93, 0.653; aroE283, 0.871; fumC114, 0.933; abcZ183, 0.963; abcZ54, 0.979; gdh60, 0.987; pdhC103, 0.992). This is slightly different from the N. meningitidis high-D SNP set identified above, as we were unable to reduce to practice a reliable kinetic PCR assay for pdhC456. Therefore, this position was deleted from the alignment and the Minimum SNPs analysis was repeated, yielding the SNP set above. After optimization, all assays were carried out at least twice on each isolate. To assess the robustness of the assays, results from each polymorph of each SNP were pooled and the mean and SD of ˜CT values were calculated. The results are shown in Tables 3–5. On all occasions, the assays clearly indicated the correct polymorph. In the case of isolates 02M5007 (ST11) and 02M5044 (ST-23), MLST determination was not carried out until the kinetic PCR assays had been completed, so the SNP interrogation was carried out blind. The kinetic PCR and MLST results were entirely consistent. As expected, the high-D SNP set provided unique polymorph profiles for each ST included in the study, despite there being no reference to these STs in the selection process for these SNPs. It was concluded that this method has potential as a rapid means for obtaining an epidemiological type for N. meningitidis. An alternative strategy for reducing the time and cost of MLST-based genotyping has been reported by Shlush et al. (2002); this involved detection of mismatches between PCR products from a standard isolate and a known isolate by using denaturing HPLC. This approach is very effective for determining whether or not an isolate has an ST of interest. However, it requires post-PCR processing and, unlike our high-D SNP sets, cannot yield a generally applicable genetic fingerprint. During the process of primer design and optimization, a potentially useful observation was made. Scrutiny of the results of kinetic PCR interrogation of fumC435 revealed that in some fumC alleles, there was a subterminal mismatch 7 bp from the 39 end of the primer. This mismatch is a consequence of diversity in the primer-binding site and is present in ST-8 and ST-42, but not in ST-11 or ST-32. This mismatch was shown to improve allele specificity by increasing CT for reactions with mismatched primers, whilst having little or no effect on CT values for matched primers (Fig. 1). The net effect of this is to increase the ˜CT . The likely reason for this effect is that the mismatch lowers the melting-point of the target–primer duplex, thus reducing the probability that the primer site will be occupied at any given time-point during the annealing step. Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 28 Jul 2017 16:57:08 Journal of Medical Microbiology 53 Bacterial typing based on computer-identified SNP sets Table 3. Kinetic PCR results on N. meningitidis DNA for ST-specific SNPs All ˜CT values obtained were in the expected orientation. NA, Not applicable. ST-11-specific SNP set fumC435 ST-42-specific SNP set pdhC12 abcZ411 ST SNP profiles: ST-11 T C C ST-8 C T C ST-32 C T C ST-42 C T T ST-23 C T T Mean SD CT values from pooled results for each polymorph: G NA NA NA A NA NA NA T 7.86 1.06 10.87 0.45 7.52 2.28 C 7.17 3.18 15.74 3.7 13.87 0.79 aroE455 fumC201 pdhC274 G G G A G G G A A A T T C T T 7.34 2.97 4.13 0.79 16.74 1.97 9.91 1.26 NA NA NA NA NA 14.09 2.25 10.72 0.95 NA Table 4. Kinetic PCR results on N. meningitidis DNA for high-D SNPs All ˜CT values obtained were in the expected orientation. NA, Not applicable; NT, not tested. pgm93 aroE283 fumC114 ST SNP profiles: ST-11 G G C ST-8 C T C ST-32 G G T ST-42 A T T ST-23 G C T Mean SD ˜CT values from pooled results for each polymorph: G 14.05 2.32 10.15 1.64 NA . . A 15 55 3 85 NT NA T NA 12.14 3.83 9.18 3.27 C 18.29 4.38 4.24 1.07 18.06 1.96 abcZ183 abcZ54 gdh60 pdhC103 G G G C C C C T C C G G G G A C C C T T 15.65 2.03 NA NA NA NA 13.24 2.44 10.99 NT 7.62 0.01 6.11 1.03 NA 13.15 1.35 13.77 3.35 13.26 2.65 NA NA Table 5. ˜CT values from interrogation of high-D SNPs in ST-11 DNA ST-11 ˜CT Matched ( SD) Mismatched ( SD) NT, pgm93 aroE283 fumC114 abcZ183 abcZ54 gdh60 pdhC103 17.45 1.76 35.66 3.59 18.15 3.19 35.14 5.88 20.82 1.38 39.79 0.54 17.05 0.69 35.34 3.75 17.08 1.18 24.97 4.31 19.85 1.06 34.15 4.47 15.93 0.74 30.38 5.41 Not tested. Kinetic PCR interrogation of S. aureus SNPs Kinetic PCR assays were developed for the S. aureus high-D SNP set: arcC210, tpi243, arcC162, tpi241, yqiL333, aroE132 and gmk129. The results are shown in http://jmm.sgmjournals.org Tables 6 and 7. As with the N. meningitidis SNPs, the assay was very reliable. After optimization, ˜CT values obtained were always in the expected orientation. This method has potential as a means for rapid fingerprinting of S. aureus. Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 28 Jul 2017 16:57:08 41 G. A. Robertson and others Identification of high-D SNP sets in other organisms and comparison of the relationship between the number of SNPs and D-values obtained Fig. 1. Effect on allele discrimination for the N. meningitidis fumC435 SNP of a subterminal mismatch 7 bp from the 39 end of the allelespecific primers. Our approach is applicable to all species for which there are MLST databases. Therefore, MLST databases for a number of different species were searched for high-D SNP sets. Species subject to this analysis were Burkholderia pseudomallei/ Burkholderia mallei, Helicobacter pylori, Campylobacter jejuni, Streptococcus pneumoniae, Streptococcus pyogenes and Enterococcus faecium (Feil et al., 2000; Enright et al., 2001; Sa-Leao et al., 2001; Dingle et al., 2002; Homan et al., 2002; Godoy et al., 2003; Meats et al., 2003). Ten different sets of SNPs were identified for each species. For each species, the relationship between D and the number of SNPs was essentially identical for each SNP set, but there were large differences between species (Fig. 2). Despite the differences, it was possible to identify high-D SNP sets in all cases. It was concluded that SNP-based typing could potentially be applied to any bacterial species for which there is sufficient comparative sequence data. The basis for differences in the relationship between the number of SNPs and D-values obtained is an intriguing Table 6. Kinetic PCR on S. aureus genomic DNA for high-D SNPs All ˜CT values obtained were in the expected orientation. NA, Not applicable; NT, not tested. arcC210 tpi243 arcC162 ST SNP profiles: ST-30 T G A ST-88 T A T ST-239 T A A ST-1 C A T Mean SD ˜CT values from pooled results for each polymorph: G NA 11.58 1.95 NA A NT 7.33 2.67 16.46 1.2 T 12.51 1.77 NA 15.46 0.85 C 11.54 1.34 NA NA tpi241 yqiL333 aroE132 gmk129 G G G G T C T C G A A A T T C C 5.6 0.52 NA NA NT NA 12.99 3.88 6.72 2.7 NA 6.63 1.35 9.57 2.66 NA 9.32 2.78 8.4 2.56 NA NA NA Table 7. ˜CT values from interrogation of high-D SNPs in ST-30 DNA ST-30 ˜CT Matched ( SD) Mismatched ( SD) 42 arcC210 tpi243 arcC162 tpi241 yqiL333 aroE132 gmk129 20.19 5.12 31.87 2.93 20.19 3.15 30.24 3.67 18.28 2.00 33.38 4.48 20.32 2.52 25.54 3.26 19.52 4.05 24.59 3.53 19.74 3.0 29.01 1.31 20.40 2.75 31.92 6.88 Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 28 Jul 2017 16:57:08 Journal of Medical Microbiology 53 Bacterial typing based on computer-identified SNP sets Therefore, this approach is suitable for bacterial pathogens, as bacteria exhibit significant interspecies variation in their diversity and clonality and, hence, also vary in the extent of linkage disequilibrium between markers that are separated by any given distance. The results presented here demonstrate that SNP combinations with quantitatively defined and useful resolving powers can indeed be identified and can also be interrogated by using efficient and convenient assays. Fig. 2. Relationships between number of SNPs and D-values obtained for MLST databases for a number of different bacterial species. STs analysed were: Helicobacter pylori, 1–370; Neisseria meningitidis, 1–2859; Streptococcus pneumoniae, 1–900; Campylobacter jejuni, 1–812; Streptococcus pyogenes, 1–108; Haemophilus influenzae, 1–95; Burkholderia pseudomallei/Burkholderia mallei, 1– 101; Enterococcus faecium, 1–63; Staphylococcus aureus, the 202 STs available for download in September 2003. In the case of H. pylori, YphC alleles 286, 288 and 310–315 contained 6 bp of missing sequence; these were filled in manually by using the consensus sequence for this region. puzzle. It is tempting to look for a relationship between clonality and D-values: a highly clonal organism would have linkage disequilibrium that extended over the entire genome, so different SNPs would be polymorphic only within certain lineages, resulting in lower combinatorial power between SNPs. The recent finding that S. aureus has a low recombination rate (Feil et al., 2003) may explain the low D-values that were obtained with this species. However, there are other factors. The extent of genetic diversity would be expected to have an effect, simply because more genetic diversity means a larger pool of SNPs and, all things being equal, a consequently higher probability of finding highly informative sets of SNPs. Also, the presence of trimorphic or tetramorphic SNPs potentially speeds up the accumulation of resolving power as each SNP is added to the set; such SNPs are much more common in more diverged species. Our current view is that both the frequency of recombination and the number and nature of SNPs available contribute to D-values that can be obtained for a given number of SNPs. Possible applications The purpose of this study was to develop a systematic and practical approach to making use of known genetic diversity in micro-organisms, in order to assemble rapid and costeffective genetic fingerprinting methodologies. The algorithms used are entirely empirical and there are no assumptions made concerning population structure or diversity. http://jmm.sgmjournals.org Possible criticisms of our work are that SNP combinations will almost never have the same resolution as the full ST, that SNP combinations are ineffective at detecting new STs and that sequencing technology is itself rapid and cost-effective. The major counter to this is that MLST is itself somewhat arbitrary in nature. There is no certainty that two isolates with the same ST do not differ in regions that are subject to high-frequency transposition or site-specific recombination events, or even in housekeeping genes that are outside the MLST-defined fragments and have undergone recombination or mutation. MLST appears to be an excellent method for identifying clonal background and carrying out longrange studies of population structure and evolution. Higher typing resolution, such as that required for reliable outbreak detection, may require methods such as PFGE. This raises the question as to the practical value of the increase in resolution provided by MLST, as compared to our SNP-based approach. It is likely that highly effective typing methodologies of the future will encompass interrogation of housekeeping genes to determine clonal background, and interrogation of hypervariable regions to increase resolution. A SNP-based approach in combination with interrogation of hypervariable regions would be expected to have a very similar performance to full ST determination in combination with interrogation of hypervariable regions. Although the ability of SNP-based typing to detect new STs is limited, in the case of high-D SNP sets, it is not zero, as novel SNP profiles can be detected. In the case of specified variant SNPs, the concern is not relevant, as these SNPs are designed only to allow efficient determination of whether or not an isolate has an ST of interest. Also, SNP-based typing does not rule out the use of MLST or other higher-resolution typing methods. Specified variant SNPs for ST-x will never provide a false-negative answer to the question ‘Does this isolate have ST-x?’, whereas high-D SNPs will never provide a falsenegative answer to the question ‘Are these isolates identical?’. When these questions are answered in the affirmative on the basis of SNP profiles, it may be appropriate in some instances to carry out full ST determination or other higher-resolution typing procedures. We therefore see SNP-based typing to be most suitable as a routinely applied, high-throughput screening tool that can flag potential outbreaks or the appearance of variants of concern. Possible applications include infection control in healthcare facilities, monitoring of food-borne disease and biodefence. It was in recognition of this that we tested the practicality of kinetic PCR as an SNP-interrogation method. This approach is single-step –there is no prior amplification required. It Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 28 Jul 2017 16:57:08 43 G. A. Robertson and others uses unlabelled primers and no probes. There is a trend toward lower reaction volumes and reduced reagent costs in real-time PCR applications, and kinetic PCR has the potential to be extremely cost-effective. We envisage this method being carried out on colonies on primary isolation plates or on blood or enrichment cultures from normally sterile sites. It could be combined easily with other gene-detection assays for, for example, virulence factor-encoding genes or antibiotic-resistance determinants, to gain a complete picture of a micro-organism in a very short time. study was supported by the Australian Federal Government Cooperative Research Centres Programme and the Queensland State Government Department of Innovation and Information Economy. In our hands, kinetic PCR is a tractable methodology. It is well-known that different 39 mismatches are subject to different mis-priming frequencies (Huang et al., 1992; Ayyadevara et al., 2000), so the different mean ˜CT values obtained at different SNPs were not unexpected. Remarkably, kinetic PCR reactions yielded approximately equal ˜CT values (although always in the expected orientation) with different polymorphs at any given SNP, indicating that the allele-specific primers in any given primer pair (or triplet) mis-primed at approximately equal frequencies. This was not expected and quite fortuitous. Primer design for N. meningitidis reactions was more challenging than for S. aureus reactions, as greater diversity in N. meningitidis means that diversity within primer sites is a factor that must be accommodated. However, we were only unsuccessful in assembling a reliable kinetic PCR assay for one SNP (pdhC456) and degenerate primers were only necessary for one other SNP (aroE283). In the case of fumC435, primer site diversity proved to be advantageous, as a mismatch 7 bp from the 39 end increased the ˜CT . Deliberate inclusion of mismatches is likely to prove a valuable tactic for maximizing ˜CT values, as has been found previously for conventional allele-specific PCR (Ishikawa et al., 1995). It can be seen, in Table 2, that the N. meningitidis allele-specific primers differ markedly at locations other than the 39 end. This was necessary to accommodate diversity in the primer sites and appears to be a viable option, as examination of sequence alignments indicated that the diversity was not distributed evenly –particular bases at polymorphic sites were linked strongly to certain sequences within the primer-binding sites. This may be the effect of insertions, deletions or inversions in the evolutionary past. With the much less diverse species S. aureus, diversity within the primer-binding sites was not an issue. Caugant, D. A., Bol, P., Høiby, E. A., Zanen, H. C. & Frøholm, L. O. (1990). Clones of serogroup B Neisseria meningitidis causing systemic In conclusion, we have demonstrated a novel and widely applicable approach to bacterial typing. It makes use of known genetic diversity and allows the typing method to be designed with reference to the resolving power required. It would be worthwhile to test this approach against large numbers of isolates. REFERENCES Ayyadevara, S., Thaden, J. J. & Shmookler Reis, R. J. (2000). Discrimination of primer 39-nucleotide mismatch by Taq DNA polymerase during polymerase chain reaction. Anal Biochem 284, 11–18. Booch, G. (1993). Object-oriented Analysis and Design with Applications, 2nd edn. Boston: Addison-Wesley. disease in The Netherlands, 1958–1986. J Infect Dis 162, 867–874. Dingle, K. E., Colles, F. M., Ure, R., Wagenaar, J. A., Duim, B., Bolton, F. J., Fox, A. J., Wareing, D. R. A. & Maiden, M. C. J. (2002). Molecular characterization of Campylobacter jejuni clones: a basis for epidemiologic investigation. Emerg Infect Dis 8, 949–955. Enright, M. C., Spratt, B. G., Kalia, A., Cross, J. H. & Bessen, D. E. (2001). Multilocus sequence typing of Streptococcus pyogenes and the relationships between emm type and clone. Infect Immun 69, 2416–2427. Enright, M. C., Robinson, D. A., Randle, G., Feil, E. J., Grundmann, H. & Spratt, B. G. (2002). The evolutionary history of methicillin-resistant Staphylococcus aureus (MRSA). Proc Natl Acad Sci U S A 99, 7687–7692. Feil, E. J., Smith, J. M., Enright, M. C. & Spratt, B. G. (2000). Estimating recombinational parameters in Streptococcus pneumoniae from multilocus sequence typing data. Genetics 154, 1439–1450. Feil, E. J., Holmes, E. C., Bessen, D. E. & 9 other authors (2001). Recombination within natural populations of pathogenic bacteria: short-term empirical estimates and long-term phylogenetic consequences. Proc Natl Acad Sci U S A 98, 182–187. Feil, E. J., Cooper, J. E., Grundmann, H. & 9 other authors (2003). How clonal is Staphylococcus aureus? J Bacteriol 185, 3307–3316. Germer, S., Holland, M. J. & Higuchi, R. (2000). High-throughput SNP allele-frequency determination in pooled DNA samples by kinetic PCR. Genome Res 10, 258–266. Giffard, P. M., McMahon, J. A., Gustafson, H. M., Barnard, R. T. & Voisey, J. (2001). Comparison of competitively primed and conventional allele- specific nucleic acid amplification. Anal Biochem 292, 207–215. Godoy, D., Randle, G., Simpson, A. J., Aanensen, D. M., Pitt, T. L., Kinoshita, R. & Spratt, B. G. (2003). Multilocus sequence typing and evolutionary relationships among the causative agents of melioidosis and glanders, Burkholderia pseudomallei and Burkholderia mallei. J Clin Microbiol 41, 2068–2079. Homan, W. L., Tribe, D., Poznanski, S., Li, M., Hogg, G., Spalburg, E., van Embden, J. D. A. & Willems, R. J. L. (2002). Multilocus sequence typing scheme for Enterococcus faecium. J Clin Microbiol 40, 1963–1971. Huang, M. M., Arnheim, N. & Goodman, M. F. (1992). Extension of base mispairs by Taq DNA polymerase: implications for single nucleotide discrimination in PCR. Nucleic Acids Res 20, 4567–4573. Hunter, P. R. & Gaston, M. A. (1988). Numerical index of the discriminatory ability of typing systems: an application of Simpson’s index of diversity. J Clin Microbiol 26, 2465–2466. Ishikawa, Y., Tokunaga, K., Kashiwase, K., Akaza, T. & Juji, T. (1995). Sequence-based typing of HLA-A2 alleles using a primer with an extra base mismatch. Hum Immunol 42, 315–318. ACKNOWLEDGEMENTS The authors thank Graeme Nimmo, John Bates, Helen Smith and the Australian Group on Antimicrobial Resistance for providing samples and helpful discussions, David Hammond for assistance with kinetic PCR and Alex Stephens for assistance with MLST determinations. This 44 Jeanmougin, F., Thompson, J. D., Gouy, M., Higgins, D. G. & Gibson, T. J. (1998). Multiple sequence alignment with Clustal X. Trends Biochem Sci 23, 403–405. Krawczak, M. (1999). Informativity assessment for biallelic single nucleotide polymorphisms. Electrophoresis 20, 1676–1681. Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 28 Jul 2017 16:57:08 Journal of Medical Microbiology 53 Bacterial typing based on computer-identified SNP sets Maiden, M. C. J., Bygraves, J. A., Feil, E. & 10 other authors (1998). Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95, 3140–3145. Meats, E., Feil, E. J., Stringer, S., Cody, A. J., Goldstein, R., Kroll, J. S., Popovic, T. & Spratt, B. G. (2003). Characterization of encapsulated and sequence typing of Streptococcus pneumoniae clones with unusual drug resistance patterns: genetic backgrounds and relatedness to other epidemic clones. J Infect Dis 184, 1206–1210. Shlush, L. I., Behar, D. M., Zelazny, A., Keller, N., Lupski, J. R., Beaudet, A. L. & Bercovich, D. (2002). Molecular epidemiological analysis of the changing nature of a meningococcal outbreak following a vaccination campaign. J Clin Microbiol 40, 3565–3571. noncapsulated Haemophilus influenzae and determination of phylogenetic relationships by multilocus sequence typing. J Clin Microbiol 41, 1623–1636. Simpson, E. H. (1949). Measurement of diversity. Nature 163, 688. Sachidanandam, R., Weissman, D., Schmidt, S. C. & 38 other authors (2001). A map of human genome sequence variation containing 1.42 Tribe, D. E., Zaia, A. M., Griffith, J. M., Robinson, P. M., Li, H. Y., Taylor, K. N. & Hogg, G. G. (2002). Increase in meningococcal disease associated million single nucleotide polymorphisms (International SNP Map Working Group). Nature 409, 928–933. Sa-Leao, R., Tomasz, A. & de Lencastre, H. (2001). Multilocus http://jmm.sgmjournals.org with the emergence of a novel ST-11 variant of serogroup C Neisseria meningitidis in Victoria, Australia, 1999–2000. Epidemiol Infect 128, 7–14. Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 28 Jul 2017 16:57:08 45
© Copyright 2026 Paperzz