Syst. Biol. 53(6):914–932, 2004 c Society of Systematic Biologists Copyright ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150490888840 Evolution of a RNA Polymerase Gene Family in Silene (Caryophyllaceae)—Incomplete Concerted Evolution and Topological Congruence Among Paralogues M AGNUS POPP AND B ENGT O XELMAN Department of Systematic Botany, Evolutionary Biology Centre, Uppsala University, Norbyvägen 18D, SE-752 36 Uppsala, Sweden; E-mail: [email protected] (M.P.) Abstract.—Four low-copy nuclear DNA intron regions from the second largest subunits of the RNA polymerase gene family (RPA2, RPB2, RPD2a, and RPD2b), the internal transcribed spacers (ITSs) from the nuclear ribosomal regions, and the rps16 intron from the chloroplast were sequenced and used in a phylogenetic analysis of 29 species from the tribe Sileneae (Caryophyllaceae). We used a low stringency nested polymerase chain reaction (PCR) approach to overcome the difficulties of constructing specific primers for amplification of the low copy nuclear DNA regions. Maximum parsimony analyses resulted in largely congruent phylogenetic trees for all regions. We tested overall model congruence in a likelihood context using the software PLATO and found that ITSs, RPA2, and RPB2 deviated from the maximum likelihood model for the combined data. The topology parameter was then isolated and topological congruence assessed by nonparametric bootstrapping. No strong topological incongruence was found. The analysis of the combined data sets resolves previously poorly known major relationships within Sileneae. Two paralogues of RPD2 were found, and several independent losses and incomplete concerted evolution were inferred. The among-site rate variation was significantly lower in the RNA polymerase introns than in the rps16 intron and ITSs, a property that is attractive in phylogenetic analyses. [Caryophyllaceae; concerted evolution; congruence test; low-copy nuclear genes; paralogy; RNA polymerase; Sileneae.] Nuclear ribosomal DNA (nrDNA) and chloroplast DNA (cpDNA) are the most widely used DNA regions to infer phylogenetic relationships among plants. And for good reasons too; both nrDNA and cpDNA are abundant in plant cells, they are usually easy to amplify and sequence using standard protocols, and the phylogenetic interpretations are generally reasonably uncomplicated. However, there are some limitations with nrDNA and cpDNA, for example incomplete homogenization of the tandem repeats in nrDNA (Buckler et al., 1997), or loss of one or more of the homoeologs in allopolyploids (Wendel et al., 1995). Due to the predominant maternal inheritance of chloroplasts, the evolutionary history of cpDNA may differ substantially from that of the main part of the nuclear genome (Cronn et al., 2002; Ferguson and Jansen, 2002), and the transfer of DNA regions from the plastid to the nucleus (Martin and Herrmann, 1998; Rujan and Martin, 2001) may further complicate the picture. Although these characteristics of nrDNA and cpDNA sometimes may cause severe difficulties when inferring organismal phylogenies, in many cases they do not. The major problem when working at shallower taxonomical levels is often that there is just not enough information to draw unambiguous conclusions from. The ambiguity may either stem from the fact that there is not enough information from nucleotide substitutions and/or the cpDNA/nrDNA trees are discordant to or incomplete with respect to the organismal history (see Sang, 2002). To deal with these problems, there has been a surge of “new” low-copy nuclear DNA regions (lcnDNA) used in plant phylogenetics the last few years, for example Adh (Yokoyama and Harry, 1993; Kosuge et al., 1995), GPAT (Tank and Sang, 2001), waxy (Mason-Gamer et al., 1998), Betv1 (Wen et al., 1997), PgiC (Gottlieb and Ford, 1996), and RPB2 (Denton et al., 1998; Oxelman and Bremer, 2000; Popp and Oxelman, 2001). Cronn et al. (2002) used 11 single-copy nuclear loci to study the evolution of the major lineages in Gossypium (Malvaceae). The incorporation of lcnDNA regions in phylogenetics has not been without obstacles. The evolution of lcnDNA regions is largely unknown but seem to be rather dynamic with fluctuating copy numbers, differences in chromosomal locations, and recombination events (e.g., Gottlieb and Ford, 1996; Clegg et al., 1997; Martin and Burg, 2002). This contributes to the difficulties of determining orthology of the sequences used and emphasizes the importance of as complete sampling of sequences as possible. The paralogy may stem from, for example, duplications within the same organism, or lateral transfer between organisms (Wendel and Doyle, 1998). In plants, allopolyploidy is an important evolutionary process, in which the entire genomes of the parental lineages are merged. It is possible to differentiate between paralogy due to gene duplication and paralogy due to allopolyploidy by using several unlinked regions simultaneously. However, few, if any, protocols for lcnDNA regions can be applied to a given plant group without optimization and redesign of polymerase chain reaction (PCR) primers and of PCR parameters. Sang (2002) considered it unlikely that there will be universal primers for the majority of low-copy nuclear genes. Instead, primers specific for less inclusive groups (families, genera) will have to be developed. The RNA polymerase (RNAP) family consists of three large nuclear DNA-dependent RNA polymerase holoenzymes in most eukaryotes. RNAPs I and III transcribe structural RNA such as rRNA and tRNA, respectively, whereas RNAP II mainly transcribes mRNA. However, a fourth member, RNAP IV, is found only in plants and its function is yet to be elucidated (The Arabidopsis Genome, 2000). In Arabidopsis thaliana, three of the 914 2004 POPP AND OXELMAN—EVOLUTION OF A RNA POLYMERASE GENE FAMILY IN SILENE genes (RPA2, RPB2, and RPC2), encoding the second largest subunits of these holoenzymes, are single-copy and are located on chromosomes 1, 4, and 5, respectively, whereas the fourth (RPD2) is present in two, presumably recently diverged paralogues located on chromosome 3 (The Arabidopsis Genome, 2000). The phylogeny of the tribe Sileneae (Caryophyllaceae) has recently been investigated using nrDNA (the ITS regions) and cpDNA (the rps16 intron) data (Oxelman and Lidén, 1995; Desfeux and Lejeune, 1996; Oxelman et al., 1997, 2001). We choose 29 species from Sileneae to study the evolution and the phylogenetic utility of the RNAP introns and to compare the results with the analyses of the ITS and rps16 data. In this paper we aim to (1) test the phylogenetic hypothesis based on ITS and rps16 data in Sileneae (Oxelman et al., 2001); (2) provide future studies of Sileneae with backbone information from several, presumably unlinked regions, thus facilitating inferences of gene duplications and allopolyploidizations; and (3) investigate the topological congruence among trees inferred from the data sets. To accomplish this we develop a general method for rapid design of primers targeting all members of a gene family; in this case, a region coding for the second largest subunits of the RNA polymerase family. We aim at a complete sampling, i.e., all orthologous and paralogous sequences, of intron regions between two highly conserved amino acid motifs (GDK and GEMERD; Fig. 1) in the genes encoding the second largest subunits (RPA2, RPB2, RPC2, and RPD2, respectively) in the RNAP family. M ATERIAL AND M ETHODS Plant Materials Plant materials used in this study are presented with voucher data and GenBank/EMBL accession numbers in Table 1. Total genomic DNA was extracted as described in Oxelman et al. (1997), or in a few cases using either DNeasy Plant Mini Kit (QiaGen) or Plant DNA Isolation 915 Kit (Boehringer Mannheim) according to the manufacturer’s manual. PCR and Sequencing Typically, 0.625 U Taq polymerase from Advanced Biotechnologies were used in 25 µL PCR reactions, with 1.5 to 2.5 mM Mg2+ , 200 µM of each dNTP, 0.5 to 1.0 µM of each primer, 0.01% bovine serum albumin (Boehringer Mannheim), and 0.5 to 1.0 µL total genomic DNA of unknown concentration. The rps16 and ITS regions were amplified using PCR cycling with an initial 5 min denaturation at 95 ◦ C, followed by 35 cycles of 30 s denaturation at 95 ◦ C, 1 min annealing at 56 to 58 ◦ C, 2 min extension at 72 ◦ C, and ended with 10 min final extension at 72 ◦ C. Primers rpsF/rpsR2R (rps16; Oxelman et al., 1997) and P17/26S82R (ITS; Popp and Oxelman, 2001) were used for PCR, and rpsF2a/rpsR3R (rps16; Popp et al., in press) and P16b/ITS4R (ITS; Popp et al., in press, and White et al., 1990, respectively) where used for sequencing. To obtain the first RNAP intron sequences we used a low-stringency nested PCR approach (Fig. 2). The first PCR was performed with all combinations of the four RNAP-specific primers kindly provided by B. D. Hall, Washington University, Seattle, USA (Table 2) to amplify the region between the highly conserved amino acid motifs GDK and GEMERD of RPA2, RPB2, RPC2, and RPD2 simultaneously (Fig. 1). After an initial 5 min denaturation at 95 ◦ C, the cycling started with 30 s at 95 ◦ C. Annealing began with 3 s at 50 ◦ C followed by a 0.3 ◦ C increase/s up to 72 ◦ C. This was followed by 72 ◦ C for 2 to 5 min (+1 s/cycle). The cycling was repeated 34 times before 15 min extension at 72 ◦ C and subsequent soak at 4 ◦ C. The result was a heterogeneous pool of PCR products, presumably including all the sought-after RNAP introns. The second PCR (Fig. 2) used the same PCR program, degenerated, though subunit specific primers, provided by B. D. Hall (Table 2), and the pooled PCR products from the previous four reactions as template. In this PCR round, the subunit introns were separated. The resulting FIGURE 1. Structure of second largest subunits of the RNAP gene family in Arabidopsis thaliana. Accession numbers RPA2 to RPD2b: AC008030, AL035527, AB012240, AB020749, and AP000377, respectively. Boxes represents exons and lines represents introns. Lengths are proportional to scale bar. Arrows indicate the highly conserved amino acid regions GDK and GEMERD, and also approximate primer sites for RNAP10F, RNAP10FF, RNAP11R, and RNAP11bR. Note that the two paralogous RPD2 sequences in A. thaliana are not orthologous to the two paralogues in Sileneae. 916 Vouchera,b 2 rps16c 1 ITSc 1 RPA2c 2 2 RPC2c D—082 D—081 D—083 F—140 D—084–85 F—1321 D—086–87 D—0882 F—141 D—090 n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. n.a. D—063–64 D—154–56 D—065 n.a. D—066 n.a. n.a. n.a. D—067, 91 n.a. n.a. n.a. D—0892 n.a. D—068 n.a. D—0692 D—1572 D—070-71 n.a. D—072 n.a. D—073 n.a. D—0742 n.a. F—139 n.a. D—076 n.a. D—0772 n.a. D—078 n.a. D—079–80 D—158 D—0752 n.a. RPB2c 2 D—102-04 D—100–01 D—120 D—098–99 D—107, 10–12, 18 D—105–063 D—147–48 D—130–312 D—108-09 D—119 D—1382 D—139–40 D—150–53 n.a. D—125 D—116–17 D—135–362 D—141–42 D—092, 442 D—113–15 D—122–24 D—093–94 D—126–272 D—121, 28–29 D—132–33 D—145–462 D—134, 51–52 D—095–97 D—137, 43, 492 RPD2a/bc a Superscript numbers indicate corresponding specimen if sequences are produced from more than one voucher. b BO = B. Oxelman; MP = M. Popp; AS = A. Strid et al.; GF = Gilbert and Fries; OH = O. Hedberg; H = Holmdahl; OT = Oxelman and Tollsten; G = Gubanov; M = Mikhajlova [?]; ME = M. Egger; C = Christodoulakis. c A— = Z831; B— = X868; C— = AJ629; D— = AJ634; E— = AJ409; F— = AJ296. 1 Agrostemma githago, 2n = 24 BO ITS-AGR 30616 (GB) , MP 1049 (UPS) A—54 B—95 C—279–80 Atocion armeria, 2n = 24 BO ITS-ARM30611(GB) A—59 B—80 n.a. Atocion lerchenfeldiana, 2n = 24 AS 24188 (C) E—061 E—057 C—281 Atocion rupestre, 2n = 24 BO 2198 (GB) A—60 B—74 C—282 Eudianthe coeli-rosa, 2n = 24 BO 2285 (GB) A—56 B—81 C—283 Eudianthe laeta, 2n = 24 BO 1876 (GB) A—55 B—82 C—284 1 2 1 1 Lychnis abyssinica, 2n = ? GF 8418 (UPS) , OH 5530 (UPS) A—61 B—90 C—2852 Lychnis chalcedonica, 2n = 24 BO 2277 (GB) A—64 B—94 C—286 A—651 B—911 C—2872 Lychnis coronaria, 2n = 24 BO 2278 (GB)1 , MP 1050 (UPS)2 Lychnis flos-cuculi, 2n = 24 BO 2200 (GB) A—63 B—93 C—288 Lychnis flos-jovis, 2n = 24 BO ITS-FLO30610 (GB) A—66 B—92 n.a. Petrocoptis pyrenaica, 2n = 24 BO 2276 (GB) A—67 B—75 C—289 A—891 B—601 C—2902 Silene acaulis, 2n = 24 BO 2243 (GB)1 , MP 1046 (UPS)2 Silene baccifera, 2n = 24 BO 2287 (GB) A—69 B—89 C—291 Silene bergiana, 2n = 24 H 1182 (GB) A—91 B—35 C—292 Silene conica, 2n = 20 BO 1944 (GB)1 , BO 1898 (GB)2 A—701 B—321 C—2932 Silene fruticosa, 2n = 24 OT 934 (GB) A—88 B—65 C—294 Silene keiskei, 2n = 24, 48 BO 2345 (UPS) C—913 C—909 C—295, 68–69 1 2 1 1 E—060 E—058 C—296–972 Silene linnaeana, 2n = 24 G 143 (MV) , M 1975.VI.28 (K) Silene ajanensis, 2n = ? Silene nivalis, 2n = 24 BO 2255 (GB) A—90 B—61 C—299 Silene nigrescens, 2n = ? KGB217 (GB) C—915 B—58 C—298 Silene nocturna, 2n = 24 OT 654 (GB) A—92 B—41 C—300 Silene noctiflora, 2n = 24 BO 2229 (GB) A—76 B—29 n.a. Silene parishii, 2n = 48 ME 886 (WTU) C—914 C—910 C—301–02 C—3031 Silene pentelica, 2n = 24 MP 1008 (UPS)1 , MP 1009 (UPS)2 , C 2046 (GB)3 AJ2949661 AJ2998122 Silene rotundifolia, 2n = 48 BO 2231 (GB) A—83 B—87 C—304 A—941 B—521 C—3052 Silene schafta, 2n = 24 BO 2264 (GB)1 , MP 1053 (UPS)2 Silene zawadskii, 2n = 24 BO 2241 (GB) A—77 B—83 C—306 Viscaria vulgaris, 2n = 24 MP 1051 (UPS) C—912 C—911 C—307 Taxon/chromosome number TABLE 1. Plant material and GenBank accession numbers. 2004 POPP AND OXELMAN—EVOLUTION OF A RNA POLYMERASE GENE FAMILY IN SILENE 917 TABLE 2. Primers used for PCR and sequencing. RNAP-specific forward (F) and reverse (R) primers used in first PCR RNAP10F TTYTCIAGYATGCAYGGICARAARGG RNAP10FF GGNGAYAARTTYDSNWSNMKRCAYGGNCAR RNAP11R ARRCARTCNCKYTCCATYTCNCC RNAP11bR GGWGARATGGARCGWGATTG Subunit-specific (A, B, C, and D, respectively) forward (F) and reverse (R) primers used in second PCR A2F GTTTGYTCTCARTTRTGGCCWG A2Ra GRACCATGTGWCGCAGHCKYTGRTA B2F TGGWCNRYBGARGGSATHAC B2R NCCRCGCAYTGRTANCCRCA C2F GAATCCWCATGGKTTYCCAAGG C2R CAACYTTRTCAGCATKACCAC D2F CCHGGNCARYTBYTDGARGCTGCYYT D2R YRCCNGTYCKDCCRTYGTADAC Sileneae-specific forward (F) and reverse (R) PCR primers(P) RPA2FP GCCGTTTTCWGAGATAACTGGGATGCGT RPA2RP GRTAATAAACAGGYCCAATAAAGATCTC F7327a CCATCYCGTATGACAATCGGYCAGCTT R7586a CCCMGTGTGACCATTGTACATTGTCT RPD2FP GCATGTGGTGGYACDTTGAGATATGCT RPD2RP CTTTCAYTYCCCCATCGACAGAATCCAG Sileneae-specific forward (F) and reverse (R) sequencing primers (S) RPA2FS CATGCRTTTCCTTCTAGRATGAC RPA2RS GTTAAMTCGGTRCCATAAACTC a F7381 AGCGTCTCCTTCCTTACCCACATGAGC R7555a CCACGCATCTGATACCCACATTTCTG RPD2FS CTGTTGAATCSATTACRGAGCAACTTC RPD2RS CAGAATCCAGCCCTGCAATC RPD2FSa GGTATCCCATTTAMGACTTNKTCTTTTG RPD2FSb GGTATCCCATTWAAGACTTRAAGGAAA a FIGURE 2. Outline of the nested PCR procedure. fragments were cloned using TOPO TA Cloning Kit (Invitrogen) according to the manufacturer’s manual, with the exception that only half of the volumes recommended for the reactions were used. Between 15 and 40 positive colonies from each reaction were screened by direct PCR (Fig. 2) using T7 and M13R universal primers. In general, 5 to 15 PCR products from individual colonies were purified using either QIAquick PCR-purification Kit (QiaGen) or Multiscreen PCR (Millipore) according to manufacturer’s manual, and subsequently sequenced, using at least one of the T7 and M13R primers. Often, the quality of the obtained sequences was very good, and only when there were ambiguities, the reverse sequencing reaction was performed. Sequencing was done with either ABI PRISM BigDye From Popp and Oxelman (2001). Terminator Cycle Sequencing Kit and visualized on an ABI PRISM 377 Sequencer (Perkin-Elmer), or DYEnamic ET Terminator Cycle Sequencing Premix Kit (Amersham Pharmacia Biotech) on a MEGABace 1000 DNA Analysis System (Amersham Pharmacia Biotech). All sequences were edited using Sequencher 3.1.1 (Gene Codes Corporation). Unique substitutions in clones from one accession were ignored and consensus sequences were constructed to reduce the effects of putative Taq errors. If a substitution was found at the same position in two or more clones from the same taxon, it was considered to be a unique sequence. The number of clones sequenced for each unique sequence is indicated on the phylogenetic trees (Figs. 3 to 8). A second and third set of subunit specific primers, this time designed after the initially obtained sequences, and thus potentially Sileneae specific, were designed and used for conventional PCR and direct sequencing. The PCR conditions for the “Sileneae specific” primers were as follows: initial denaturation at 95 ◦ C for 5 min followed by 35 cycles of 95 ◦ C for 30 s, 56 to 58 ◦ C for 1 min, and 72 ◦ C for 2 min. The PCR ended with 10 min at 72 ◦ C and subsequent soak at 4 ◦ C. Whenever there was an indication that the PCR product was not unique, either from muliple bands visualized on the agarose gel or from double signals on chromatograms from direct sequencing reactions, cloning of the PCR was performed. Preliminary analyses indicated two copies of RPD2 and a set of paralogue-specific forward primers were 918 VOL. 53 SYSTEMATIC BIOLOGY constructed (Table 2) and used for both PCR and sequencing. The cloning procedure used in this study has the disadvantage of PCR-mediated recombinants being sequenced. To detect recombinants, the criteria of Popp and Oxelman (2001) was employed. Alignment and Gap coding The sequences were manually aligned using Se-Al Ver. 1.0a1 (A. Rambaut, http://evolve.zoo.ox.ac.uk). Gaps (inferred insertions/deletions) were introduced in the sequences to keep the number of substitutions in an aligned region to a minimum. Equal costs were assumed for gap opening and extension versus substitutions, but lower costs to substitutions in case of ties. Two or more gaps of equal length inferred at the same position were assumed to be a homologous character, and was binary coded as present/absent. Large autapomorphic insertions were excluded from the analyses. These insertions varied between 50 bp (Silene bergiana, rps16) and 710 bp (Lychnis coronaria, RPA2). we choose a very high “significance level,” 0.99, for those. Based on previous analyses (Oxelman and Liden, 1995; Desfeux and Lejeune, 1996; Oxelman et al., 1997, 2001), Agrostemma githago was used for outgroup rooting. In addition to separate analyses of each region, MP and MPB analyses of the combined data were performed with the same settings. One sequence per species was only found in the rps16 and ITS data sets, and to be able to concatenate the data sets the numbers of terminals were reduced in the other matrices. We used consensus sequences from sequences found to be monophyletic within species. In the cases where sequences were found to be para- or polyphyletic within species, all sequences from that species were excluded from the particular data set, and instead treated as missing data. This excluded, in particular, the polyploid taxa from RPB2 and RPD2, as well as S. acaulis, S. nivalis, and S. fruticosa from RPD2a. Congruence Assessment Phylogenetic Analyses Maximum parsimony analyses (MP in the following text) of all six data sets were performed separately with PAUP∗ version 4.0b10 for Macintosh (PPC), or UNIX (Swofford, 2002), using heuristic search, random addition sequence with 100 replicates, tree bisectionreconnection (TBR) branch swapping, MULTREES option on, and DELTRAN optimization (ACCTRAN optimization may cause erroneous branch lengths in branch lengths tables and when printing trees due to a bug in PAUP∗ version 4.0b10). Maximum parsimony bootstrap analyses (MPB in the following text) were carried out with full heuristics, 1000 replicates, TBR branch swapping, MULTREES option off, and random addition sequence with four replicates. Maximum likelihood estimates of all parameters, including branch lengths, was determined on one arbitrarily chosen most parsimonious tree for each data set. Bayesian posterior probabilities (PPs) for the nodes in the phylogenetic trees were estimated using MrBayes version 3.0B4 (Huelsenbeck and Ronquist, 2001). Each data set was analyzed with the default prior distributions and an optimal model of evolution determined by MrModeltest version 1.1b (Nylander, 2003). MrModeltest is a modified version of Modeltest (Posada and Crandall 1998), which only considers models that can be used by MrBayes. Indels were included in the analysis and treated as binary “morphological” characters with absence of an indel coded as “0,” and presence as “1.” The MCMC chains were run for 200000 generations. Every 100th tree was saved, resulting in 2001 trees for each data set, of which the first 501 was discarded. This strategy supposedly conservatively discard the burn-in phase for likelihood scores, but there is no guarantee that this is so for the group frequencies, which is the parameter of interest here. Therefore, and because several studies have indicated high error rates on PPs for groups (e.g., Erixon et al., 2003), PLATO Ver. 2.0 (Grassly and Holmes, 1997) was used to test for incongruence in the combined data set. This software uses a sliding window to find regions in a nucleotide matrix that do not fit a given model, and was originally intended to discover possible recombination or selection in a maximum likelihood context. Regions with a significantly low likelihood indicate a deviation from the a priori phylogenetic model and therefore also possible topological incongruence. By assembling the sequence regions in an arbitrary order, we avoid restricting ourselves to the boundaries defined by the PCR primers. Thus, we also enable the detection of recombination events within the individual sequenced regions. One potential drawback of this approach might be that the order of the regions may affect the results. However, as long as detected deviating regions do not cross region borders, this appears to be unproblematic. Due to the prohibitively long computation time required for a full maximum likelihood (ML in the following text) analysis estimating all parameters, ML analyses were performed as follows: a MP topology was obtained with a heuristic search carried out as described above. Initial parameter values were estimated using an arbitrary MP topology and a general time-reversible model with substitution rate variability among sites following a gamma distribution (GTR+ model), as suggested by MrModeltest (Nylander, 2001). A complete TBR branch swapping with fixed GTR+ parameter values was performed under the ML criterion with the MP topology as starting tree. The ML topology obtained was used to reestimate the parameter values, which in turn were used to perform TBR branch swapping of the ML topology. Finally, we performed a heuristic search with five random sequence additions, TBR branch swapping and GTR+ using parameter values from the last iteration. The ML topology thus obtained from the combined datasets and the other model parameter values estimated with PAUP∗ constituted the input for PLATO. 2004 POPP AND OXELMAN—EVOLUTION OF A RNA POLYMERASE GENE FAMILY IN SILENE 919 TABLE 3. Matrix and tree statistics. Terminals Included characters Number/% PICa Number/lengths of MP trees CI/RI a rps16 ITS RPA2 RPB2 RPD2 Combined 29 933 119/13 691/378 0.722/0.738 29 533 157/29 6/625 0.536/0.638 29 883 220/25 528/620 0.823/0.836 33 728 229/31 3836/624 0.768/0.765 60 1070 366/34 1296/1118 0.719/0.886 29 5033 962/19 2/3240 0.739/0.718 Percentage parsimony informative characters. Another limitation with PLATO (and many other model-based congruence tests) is that it is not possible to distinguish between the topology parameter and other parameters such as the shape of the gamma distribution, base frequencies, or the substitution rates. In other words, it is not possible to discern whether data have evolved under a different topology or if other parameters are causing the potential anomalies detected. We used a nonparametric bootstrap approach to isolate the topology parameter from the rest of the parameters in the model and test if all the data sets evolved under the same topology. Let Xcomb be the combined data set excluding a deviating data set, and Xdev be the deviating data set. Further, let Tcomb be the ML topology for the combined data set excluding the deviating data set, and Tdev be the ML topology for the deviating data set. Xidev denotes the ith pseudoreplicate of Xdev obtained by nonparametric bootstrapping. The free parameters (all but topology) in the model are denoted and the −log likelihood −ln L. We generated 100 pseudoreplicates of Xdev and −ln L (Xidev | Tdev ) was calculated reestimating for each pseudoreplicate, thus generating a null distribution. Then −ln L (Xdev | Tcomb ) was calculated reestimating , and subsequently ranked in the null distribution. The null hypothesis was rejected if −ln L (Xdev | Tcomb ) was extreme at some level of significance. As noted by Goldman et al. (2000), the selection of a ML topology a posteriori (in our case, the deviating data ML topology) obscures the statistical interpretation of the obtained probability value. First, the test must obviously be one-sided, because the ML topology has higher likelihood than any other tree. Secondly, the probability must be corrected for all other possible tree topologies, as is done by some implementations of the ShimodairaHasegawa test (Goldman et al., 2000). This, however, severely reduces the power of the test. Parametric tests, such as those devised by Goldman et al. (2000), are much more sensitive, but also much more dependent on adequate models. Therefore, we refrain from making strict probabilistic conclusions from our tests, but rather use them to evaluate the relative topological incongruence from the data sets identified by PLATO as favoring significantly different models. R ESULTS Table 3 summarizes the number of terminals, included characters, parsimony informative characters, percentage parsimony informative characters, number and lengths of MP trees, consistency index (CI), and retention index (RI) for the different DNA regions. The ML estimates of model parameter values for each data set and the combined data set are presented in Table 4. MPB percentages and posterior probabilities for groups in the tree from the combined data set (see Fig. 8) and comparable groups in the individual data sets are presented in Table 5. rps16 and ITS Both the rps16 and ITS data sets support the monophyly of Atocion, Lychnis, and Eudianthe (Figs. 3 and 4). Although Silene was recovered in a strict consensus of the most parsimonious trees from the ITS data (Fig. 4), neither the rps16 nor the ITS data sets have MPB percentages above 50 for a monophyletic Silene. There were TABLE 4. Maximum likelihood estimates of separate and combined data sets under the general time reversible + gamma (GTR+) model and maximum parsimony topologies. Data partition rps16 −ln L 3205.6218 Base frequencies A 0.3543 C 0.1321 G 0.1631 T 0.3505 Relative nucleotide substitution rates AC 1.1098 AG 1.0683 AT 0.3726 CG 0.3289 CT 1.3763 Gamma shape (α) 0.4228 ITS RPA2 RPB2 RPD2 Combined 3735.0188 4458.4968 4040.7808 7538.5282 23030.0220 0.1906 0.2855 0.2946 0.2294 0.2585 0.1742 0.1867 0.3807 0.2576 0.1559 0.2171 0.3694 0.2580 0.1799 0.1744 0.3876 0.2728 0.1801 0.1989 0.3482 1.1337 2.6794 2.5900 0.3088 5.6042 0.3581 1.1998 3.3094 1.0250 1.1390 2.4894 1.8761 0.5638 1.9755 0.8088 0.9604 2.0557 1.3218 0.7559 2.4075 0.7235 1.3284 2.0409 3.0581 0.8804 2.0159 0.8003 0.7877 2.3557 0.7585 920 VOL. 53 SYSTEMATIC BIOLOGY TABLE 5. Summary of MPB percentages/posterior probabilities for nodes in the combined data tree (Fig. 8), and comparable nodes in the individual data sets. Negative numbers refer to conflicting nodes, numbers in italics indicate groups that are not found in all most parsimonious trees. Incongruencies that are considered “hard” are indicated in bold. Node 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Combined rps16 ITS RPA2 RPB2 RPD2a RPD2b 99/1 100/1 100/1 90/1 98/1 100/1 85/.99 100/1 71/.63 100/1 100/1 95/1 100/1 73/.63 100/1 53/.98 100/1 100/1 79/1 86/1 100/1 100/1 96/1 96/1 96/.96 75/.77 100/1 87/1 54/.92 −55/−.58 100/1 <50/<.50 99/1 −82/−.93 88/1 96/1 <50/<.50 −50/−.70 −50/−.70 97/1 −75/−.76 100/1 90/1 <50/<.50 −<50/<.50 −53<.50 100/1 91/1 <50/<.50 <50/<.50 96 87 98 <50 <50 100 <50 100 −56 <50 63 <50 64 62 93 −86 99 <50 <50 −<50 73 98 <50 95 69 n/a 100 100 83 90 100 <50 100 86 98 n/a −67 81 −81 96 100 99 98 <50 −66 n/a <50 78 <50 <50 100 n/a 50 <50 100 100 <50 <50 <50 100 100 <50 <50 −100 <50 −93 −100 <50 <50 <50 <50 99 −86 <50 n/a 100 n/a n/a 80 −62 100 95 100 91 100 100 98 <50 −94 100 n/a 100 − <50 <50 <50 89 100 68 −56 <50 100 n/a 99 <50 n/a n/a <50 100 <50 86 100 <50 n/a n/a n/a n/a n/a <50 <50 <50 61 100 64 73 n/a two main groups within Silene that are consistently recovered in the rps16/ITS analyses (Oxelman and Liden, 1995; Oxelman et al., 1997, 2001). In this study, one of these two groups was represented by Silene nocturna, S. bergiana, S. schafta, S. fruticosa, S. acaulis, and S. nivalis and will in the following text be referred to as Silene subgenus Silene, whereas the other group (the rest of the Silene species) will be referred to as Silene subgenus Behen (Moench) Bunge. No well-supported topological incongruences were detected between the previous analysis of the combined ITS-rps16 data sets (Oxelman et al., 2001) and the separate analyses performed here. RNA Polymerase Introns Using either low stringency PCR conditions with degenerated primers and a nested PCR approach, or direct PCR with Sileneae-specific primers, we amplified at least one fragment from three of the four RNAP regions (RPA2, RPB2, and RPD2) in all taxa. RPC2 was excluded from further study after the initial sequences were produced, see results below. Direct sequencing of PCR fragments produced with the Sileneae-specific primers resulted in clean, unambiguous sequences in most cases. The 5 and 3 ends of the sequenced regions had a varying, but relatively high, degree of similarity to the corresponding regions of Arabidopsis thaliana (Fig. 1), and intron positions from Sileneae sequences were inferred to be identical to intron positions in A. thaliana. Several taxa were polymorphic for one or more amplified regions and it was necessary to clone the PCR products to obtain readable sequences. Some of the taxa contained only mono- phyletic groups of sequences for a given region, whereas other taxa contained two or more nonmonophyletic sequences (Figs. 5 to 7). RPA2.—Fourteen synonymous substitutions were found in the ca. 80 bp long sequenced region corresponding to 3 end of exon 23 in A. thaliana (Fig. 1, GenBank AC008030). A total of 56 substitutions, of which 15 were nonsynonymous within Sileneae, were found in the ca. 190 bp long region corresponding to exon 24 in A. thaliana. Furthermore, Agrostemma githago had one extra glutamic acid, whereas Eudianthe laeta appears to have lost a valine in exon 24. No stop codons were identified in either of the two sequenced exon regions in RPA2. Intron size for most taxa varied between 460 and 550 bp, with extremes found in A. githago (300 bp) and Lychnis coronaria (1185 bp), compared to 161 bp in A. thaliana. The RPA2 sequences from A. githago, S. linnaeana, and S. parishii were unreadable when sequenced directly. Cloning revealed two different, though monophyletic, sequences in all three taxa (Fig. 5). Most groups with MPB percentages above 50 in the RPA2 phylogeny were congruent with the previous analysis of the combined ITS and rps16 data sets (Oxelman et al., 2001). RPB2.—The RPB2 intron in Sileneae varied between 462 bp (S. keiskei) and 519 bp (A. githago). The 5 and 3 regions corresponded to exons 23 and 24, respectively, in A. thaliana (Fig. 1). Only synonymous substitutions were found in the ca. 70 (13 substitutions) and 50 (4 substitutions) bp long sequenced regions of exons 23 and 24, respectively, and one nonsynonymous substitution between Sileneae and A. thaliana in exon 24. 2004 POPP AND OXELMAN—EVOLUTION OF A RNA POLYMERASE GENE FAMILY IN SILENE 921 FIGURE 3. One of 691 most parsimonious trees from the analysis of rps16. Branch lengths are proportional to the inferred number of substitutions per site under the GTR+ model. Numbers above branches indicate parsimony bootstrap percentages over 50, numbers below branches represent posterior probabilities. 922 SYSTEMATIC BIOLOGY VOL. 53 FIGURE 4. One of six most parsimonious trees from the analysis of ITS. Branch lengths are proportional to the inferred number of substitutions per site under the GTR+ model. Numbers above branches indicate parsimony bootstrap percentages above 50, numbers below branches represent posterior probabilities. Nodes without numbers have bootstrap percentages <50. Dotted branches collapse in the strict concensus tree, numbers below branches represent posterior probabilities. 2004 POPP AND OXELMAN—EVOLUTION OF A RNA POLYMERASE GENE FAMILY IN SILENE 923 FIGURE 5. One of 528 most parsimonious trees from the analysis of RPA2. Branch lengths are proportional to the inferred number of substitutions per site under the GTR+ model. Numbers associated with taxon names refer to number of clones sequenced. The number is followed by an asterisk if the PCR product was obtained by nested PCR and degenerated primers. Numbers above branches indicate parsimony bootstrap percentages, numbers below branches represent posterior probabilities. 924 SYSTEMATIC BIOLOGY VOL. 53 FIGURE 6. One of 3836 most parsimonious trees from the analysis of RPB2. Branch lengths are proportional to the inferred number of substitutions per site under the GTR+ model. Numbers associated with taxon names refer to number of clones sequenced. The number is followed by an asterisk if the PCR product was obtained by nested PCR and degenerated primers. Numbers above branches indicate parsimony bootstrap percentages, numbers below branches represent posterior probabilities. 2004 POPP AND OXELMAN—EVOLUTION OF A RNA POLYMERASE GENE FAMILY IN SILENE The phylogenetic analyses of the RPB2 sequences resulted in a basally poorly resolved tree (Fig. 6). Agrostemma githago, E. coeli-rosa, and L. flos-cuculi were polymorphic, and cloning revealed two different, though monophyletic, sequences in each of the three taxa. Three other taxa, S. keiskei, S. rotundifolia, and S. parishii were also cloned due to polymorphisms. These sequences did not form monophyletic groups within species. The monophyly of Atocion was strongly supported (MPB percentage 100), whereas neither Silene nor Lychnis were resolved as monophyletic in the MPB analysis. Lychnis chalcedonica was not found in any of the two Lychnis groups (both with MPB percentage 100), and Silene itself consisted of two well-supported (MPB percentages 99 and 100, respectively) clades and a few “stray species” in a polytomy with the rest of the ingroup. RPC2.—Only a few RPC2 sequences (Table 1) were produced because a 35- to 45-bp AC/T repeat close to the 3 end caused sequencing problems. The length of the fragment, 1300 to 1400 bp, would have made it necessary to make at least four separate sequencing reactions to sequence the entire fragment in both directions. The forward primer site C2F is located at the very 3 end of exon 31 of the A. thaliana sequence (GenBank AB012240). The ca. 95-bp exon sequence corresponding to exon 36 in A. thaliana, had only synonymous substitutions in the taxa investigated in this study. No RPC2 sequences were included in the analyses. RPD2.—The partial exon sequences in Sileneae corresponds to the 3 end of exon 6 (Fig. 1, Table 5) in the two paralogues found in A. thaliana (GenBank AB020749 and AP000377). The Sileneae-specific RPD2RP reverse primer is located one nucleotide position downstream of the 5 end of exon 7. Two or more copies of RPD2 were found in most species in Sileneae. The phylogenetic analysis showed that there were two groups, RPD2a and RPD2b, of paralogous RPD2 sequences (Fig. 7). Only RPD2a sequences were found in E. coeli-rosa, Petrocoptis pyrenaica, and in the subgenus Silene clade, whereas only the RPD2b copy was found in Viscaria vulgaris. A single sequence was found in A. githago. The RPD2a and RPD2b sequences are readily alignable over most of the area in Sileneae, whereas they could not be reliably aligned with the Arabidopsis sequences. The length of the intron sequences varied between 227 bp (S. fruticosa) and 1123 bp (L . flos-cuculi), with most sequences being ca. 700 to 750 bp long. In all taxa belonging to the subgenus Silene clade, two or more discrete sequences belonging to the RPD2a group were found after cloning. Most of these sequences did not form monophyletic groups within species (Fig. 7). Three sequences were found in S. fruticosa. Two of them were very short, 296 and 328 bp, respectively. The third sequence was 624 bp, a more “normal” length. The two short sequences had a large deletion from the end of the forward PCR primer (RPD2FP) to approximately half the intron. These sequences appeared in multiple independent PCR reactions, indicating that they might represent pseudogenes. Although S. nocturna successfully was sequenced directly without cloning, it contained 12 925 polymorphic sites. This sequence was sister to one of the S. bergiana sequences, with the second S. bergiana sequence as sister to this clade (MPB percentage 100). The S. nocturna polymorphisms did not correspond to the disagreements between the two S. bergiana sequences. The two sequences from S. schafta formed a monophyletic group in a trichotomy with the rest of the taxa in subgenus Silene. The two RPD2b sequences found in S. parishii were not monophyletic (Fig. 7), but sisters to S. nigrescens and S. rotundifolia, respectively. The latter S. parishii sequence lacks the two last amino acids in exon 6 and has one amino acid substitution (a leucine for a proline). There was alignment ambiguity in the beginning of the intron, and a substantial proportion of the conserved splice region is missing. Thus, it seems likely that this sequence is a pseudogene. All included genera were strongly supported as monophyletic groups with MPB percentages of 98 or 100 by the RPD2a sequences. The results from the previous analysis of ITS and rps16 (Oxelman et al., 2001) and the wellsupported RPD2a topology (Fig. 7) were congruent at the generic level. The RPD2b clade was less well resolved than the RPD2a clade and there was no MPB support above 50 for a monophyletic Silene. Both Lychnis and Atocion had MPB percentages of 100, though, and the topology was largely congruent with the ITS and rps16 data. Analysis of the combined data sets.—PLATO identified part of ITS (464 aligned bp, Z = 26.8), RPA2 (398 aligned bp, Z = 6.2), and RPB2 (459 aligned bp, Z = 15.3) as having significantly lower likelihoods with the ML topology for the combined datasets and the GTR+ model with parameter values estimated in PAUP∗ . These regions contains 97%, 66%, and 92% of the parsimony informative characters in the ITS, RPA2, and RPB2 data sets, respectively. The likelihood for the ITS data set evolving under the ML topology of the combined data set excluding ITS was ranked as number 72 among the 100 bootstrapped ITS data sets (i.e., P = 0.28). The corresponding P-values were 0.30 and 0.12 for the evolution of the RPA2 and RPB2 data sets under ML topologies inferred, excluding RPA2 and RPB2, respectively, from the combined data set. Thus, no strong topological incongruence was detected, and all data sets were included in the combined analysis. The parsimony bootstrap analysis of the combined data sets resulted in a well-resolved topology where all genera have MPB percentages of 95 or higher (Fig. 8). Lychnis and Silene were resolved as sister to Atocion, Viscaria, Eudianthe, and Petrocoptis. A previously unresolved sister-group relationship (e.g., Oxelman et al., 2001) between Eudianthe and Petrocoptis was also found in the latter clade. The result are compatible with previous results from Oxelman et al. (2001). D ISCUSSION We will discuss the results from each separate RNAP intron separately. Results from more extensive phylogenetic analyses of rps16 and ITS sequences in Sileneae 926 SYSTEMATIC BIOLOGY VOL. 53 FIGURE 7. One of 1296 most parsimonious trees from the analysis of RPD2. Branch lengths are proportional to the inferred number of substitutions per site under the GTR+ model. Numbers associated with taxon names refer to number of clones sequenced. The number is followed by an asterisk if the PCR product was obtained by nested PCR and degenerated primers. Numbers above branches indicate parsimony bootstrap percentages, numbers below branches represent posterior probabilities. 2004 POPP AND OXELMAN—EVOLUTION OF A RNA POLYMERASE GENE FAMILY IN SILENE 927 FIGURE 8. One of two most parsimonious trees from the analysis of the combined dataset. Branch lengths are proportional to the inferred number of substitutions per site under the GTR+ model. Numbers above branches indicate parsimony bootstrap percentages, numbers below branches represent posterior probabilities. Numbers to the right of branching points represent the branches presented in Table 5. 928 SYSTEMATIC BIOLOGY have been discussed thoroughly elsewhere (Oxelman and Liden, 1995; Oxelman et al., 1997, 2001), and we will only discuss results that deviate from these analyses. Finally, we will discuss the combinability of the different sequence regions and the general utility of the RNAP strategy proposed in this paper. RPA2 Although strong bootstrap support for monophyly of the genus Silene is found only in one of the RPD2 paralogues (Fig. 7), the sister-group relationship between the Silene clade and the rest of the ingroup shown in the RPA2 phylogeny (Fig. 5) is somewhat unexpected when compared to previous studies (Oxelman and Liden, 1995; Oxelman et al., 1997, 2001) as well as the rest of the results in this study. However, the incongruence is poorly resolved, and might be due to stochastic effects together with lack of information. Nevertheless, we find it instructive to examine other explanations to this putatively incongruent gene phylogeny in some detail. Wendel and Doyle (1998) list a number of biological phenomena that can lead to incongruence, such as orthology/paralogy conflation, lineage sorting, rate heterogeneity among taxa, hybridization/introgression, and short internal branches. One of the aims of this study is to minimize the risk of orthology/paralogy conflation by using degenerated primers and low-stringency PCR conditions in combination with cloning to amplify and sequence all possible paralogues. RPA2 polymorphisms were found in three taxa; S. parishii, S. linnaeana, and the outgroup Agrostemma githago, but because they form monophyletic groups within species, the polymorphisms may be explained either by allelic variation or autapomorphic gene duplications. No traces of an ancient gene duplication or polymorphic RPA2 gene pool were detected, and we conclude that there is no support for a hypothesis involving orthology/paralogy conflation. Lineage sorting (e.g., Pamilo and Nei, 1988), and failure of alleles to coalesce within a species lineage, is very difficult to distinguish from orthology/paralogy conflation. It is, however, unlikely that a polymorphic RPA2 allele pool has been maintained during a time span long enough for fixation of one allele and loss of the other in the Silene clade, while the opposite allele is fixed and lost, respectively, in all other lineages investigated here. It is likely that a hybridization event would leave traces in more than one nuclear gene (e.g., Cronn et al., 1999). The topological pattern found in RPA2 is not found in any of the other four nuclear DNA regions, nor did we find any pattern of strong incongruence between the maternally inherited (Corriveau and Coleman, 1988) cpDNA and the nuclear DNA as sometimes is seen in hybrids if evolutionary processes homogenize the paralogues (e.g.,. Brochmann et al., 1996). A hybridization event is therefore not supported as an explanation to the apparent incongruence. In phylogenetic analyses using inconsistent models, rate heterogeneity among taxa may confound phylogeny estimations as a result of parallel substitution in faster VOL. 53 evolving taxa, i.e., “long branch attraction” (Felsenstein, 1978), and therefore cause incongruence between different data partitions. A solution to this may be a denser taxon sampling to break up the long branches. The sampling of taxa from the Silene clade is rather scattered, and perhaps a denser sampling would result in a topology more in line with results from the analyses of the other datasets. A second solution is to use a phylogenetic method more robust to rate heterogeneity. Given a reasonable model and enough data, maximum likelihood is often suggested to be more robust than parsimony (Felsenstein, 1973). Analyzing the RPA2 data with maximum likelihood method as implemented in PAUP∗ , using the HKY+ model suggested by MrModeltest 1.1b (Nylander, 2001), five random additions with TBR branch swapping, and estimating all parameters from the data resulted in a topology (not shown) that was basically the same as when analyzed with parsimony. Thus, it must either be that the inferred RPA2 phylogeny is not confused by long branch attraction, or the model is not robust to deviations in the data. The conclusion is that there is no positive evidence for long branch attraction as an explanation to the observed pattern in RPA2. In the case of short internal branches, i.e., if lineage splits are common relative to the substitution rate, a relatively high degree of random variation is expected in the data, and a phylogenetic analysis would result in a poorly resolved tree. Separate analyses of several putatively independent DNA regions are predicted to have weakly supported “soft incongruences” (Wendel and Doyle, 1998) due to these phenomena, and the incongruences are expected to vanish when more data are added. Most of the MP topologies from the analyses of the separate data sets are poorly resolved and poorly supported by MPB to a varying degree. In the light of this, plus the fact that the MPB support of subgenus Silene as sister to the rest of the ingroup was low (two nodes with MPB percentages of <50 and 67, respectively), we draw the conclusion that the putative incongruence is a result of random variation as an effect of short internal branches. This hypothesis is not rejected by the likelihood score rank test discussed below. RPB2 Several taxa in RPB2 had to be cloned due to polymorphisms making the sequences unreadable when the PCR products were sequenced directly. If sequences within a species were monophyletic (Fig. 6), the polymorphisms were assumed to be caused by divergent alleles in heterozygous individuals, or autapomorphic gene duplications. The sequences from S. keiskei, S. rotundifolia, and S. parishii did not form monophyletic groups within species, but nonmonophyly did not receive strong support. A weakly supported monophyletic group consisting of one sequence from S. parishii and one from S. keiskei is found together with the other sequences in a polytomy. Silene zawadskii and S. nigrescens are resolved as sister group to these. The pattern is not too surprising as all three former taxa are tetraploids. The variation in 2004 POPP AND OXELMAN—EVOLUTION OF A RNA POLYMERASE GENE FAMILY IN SILENE the sequenced part of RPB2, however, is not enough to resolve the internal relationships of the paralogues. This group, also including S. linnaeana (MPB percentage of 99), is one of two well-supported groups found within Silene in the bootstrap analysis; the other group was within the subgenus Silene clade. Contrary to previous analyses (Oxelman and Liden, 1995; Oxelman et al., 1997, 2001) and the RPA2 and RPD2 trees (Figs. 5 and 7), S. bergiana is not resolved as sister to S. nocturna. Silene bergiana is found to be sister to the rest of the clade in the strict consensus from the MP trees, but the MPB percentage for this relationship was less than 50. There is no MPB support above 50 from RPB2 data neither for the otherwise well supported Lychnis clade nor for the Atocion/Viscaria clade. RPD2 By comparing the short sequenced regions of exon 6 (Table 6) in the RPD2 paralogues found in Sileneae and Arabidopsis thaliana, it is clear that the paralogues in Sileneae are more closely related to each other than to either of the two paralogues in A. thaliana. Despite cloning of the PCR products and the use of paralogue-specific primers, only a single sequence was found in Agrostemma githago, the outgroup in this study. Several of the indels and substitutions diagnostic for either of the paralogues are found in the single sequence from A. githago. This indicates that the duplication occurred in the branch leading to the ingroup, or the sequence may be a result of incomplete concerted evolution (see below). However, the orthology of this sequence as well as the duplication event, cannot be determined until other alignable outgroup sequences are added. A single gene duplication is not enough to explain the RPD2 gene phylogeny (Fig. 7). If one accepts only branches with MPB >95%, at least two more gene duplications and two losses have to be inferred in the subgenus Silene clade. Firstly, one duplication and one loss in the lineage leading to the clade consisting of S. acaulis and S. fruticosa, and secondly, one duplication and one loss in the lineage leading to S. bergiana and S. nocturna (Fig. 7). As there are numerous species that are morphologically more close to either of these two species, it seems unlikely that S. nocturna would be derived from within S. bergiana. Also S. schafta contains two sequences, and although the two sequences were fairly divergent, it may be explained by heterozygosity. If one accepts all branches in the strict consensus tree (Fig. 7), several additional duplications and losses have to be inferred. All six species mentioned above are diploid, and therefore only one or two sequences are expected to be found from a single locus. The bootstrap support is low for most of the nodes, and part of the pattern could be explained by 929 lineage sorting in recently diverged lineages. However, in S. nivalis and S. fruticosa, three sequences were found in each specimen, and lineage sorting alone cannot explain the pattern. No RPD2b paralogues were recovered in any of the sampled specimens from the subgenus Silene despite extensive cloning and the use of paraloguespecific primers, implying at least one more paralogue extinction. Due to the perfect correspondence between multiple copies of RPD2a paralogues and the complete loss of the RPD2b paralogues in subgenus Silene, an alternative explanation to the observed pattern is incomplete concerted evolution. As a result of incomplete concerted evolution between RPD2a and RPD2b, one would expect mosaic sequences in case of reciprocal recombination, or more homogenous sequences if gene conversion is operating (Wendel and Doyle, 1998). Incomplete concerted evolution has been suggested to occur in small nuclear gene families such as Adh in Gossypium (Millar and Dennis, 1996), PgiC in Clarkia (Gottlieb and Ford, 1996), and glutamine synthetase in Pisum (Walker et al., 1995). No obvious mosaic sequences were found in the subgenus Silene RPD2 data set. Thus, there is no support for reciprocal recombination as the evolutionary process. Incomplete concerted evolution by gene conversion is a simpler explanation than a number of independent duplications and losses, and therefore a preferred hypothesis. Combined Analysis and Comparisons of the Data Sets Both the rps16 intron and ITS data sets show high among-site rate variation (α = 0.42 and 0.35, respectively; Table 4). The high among-site rate variation is likely correlated to constraints imposed by the secondary structure found in the rps16 group II intron (e.g., Kelchner, 2002) and ITS (e.g., Baldwin et al., 1995). The RNAP introns, on the other hand, have low among-site rate variation (α = 1.88, 1.32, and 3.06 for RPA2, RPB2, and RPD2, respectively; Table 4) and consequently seem to be free from such constraints to a much greater extent. PLATO detected several regions where the ML topology had significantly low likelihood to explain the combined data set. These differences in likelihood score indicate deviations from the model including the topology parameter and/or other model parameter values (ML estimates from the combined data set) supplied as input to PLATO (Grassly and Holmes, 1997). Using PLATO, it is not possible to discern whether data have evolved under a different topology or if other parameters are causing the anomalies. Are the observed differences caused by different evolutionary histories, or are they just an effect of stochastic variation? To answer that question, it would be desirable to test whether or not the ML topologies TABLE 6. Amino acid alignment of the 3 end of exon 6 in RPD2 Sileneae RPD2a Sileneae RPD2b A. thaliana 1 A. thaliana 2 GKGIAC- - - - - -GG- - - - - - -T-L-RYATPFSTPSVESITEQLH GKGIAC- - - - - -GG- - - - - - -T-L-RYATPFSTPSVESITEQLH SKGIACPIQKKEGSSAAYTKLTRHATPFSTPGVTEITEQLH SKGIACPIQK–EGSSAAYTKLTRHATPFSTPGVTEITEQLH 930 SYSTEMATIC BIOLOGY inferred from the deviating regions (i.e., ITS, RPA2, and RPB2, respectively) discovered with PLATO are different from the ML topologies inferred from the rest of the combined data (excluding, in turn, ITS, RPA2, and RPB2) at some level of significance. Typically, different implementations of parametric bootstrapping (Huelsenbeck and Bull, 1996) are formulated to test whether the maximum likelihood topology or an alternative topology is true (Goldman et al., 2000). However, because we are interested in whether the ML topology differs significantly from an alternative topology, this is not the appropriate question to ask in our case. Parametric tests of topologies are highly sensitive to model misspecification (Buckley, 2002). The nonparametric tests of Kishino and Hasegawa (KH test) (Hasegawa and Kishino, 1989; Kishino and Hasegawa, 1989) and Shimodaira-Hasegawa (SH test) (Shimodaira and Hasegawa, 1999) have been used for pairwise (KH test) or multiple (SH test) test of topologies. Goldman et al. (2000) showed that the KH test is not appropriate if one of the compared topologies is also the ML topology chosen a posteriori. If only two topologies are compared using the SH test, this test reduces to the KH test (Shimodaira and Hasegawa, 1999). Because one of the two topologies we want to compare is always the ML topology in this study, the SH test is also improper to use. Therefore, we do not use the KH test per se, but rather measure how the ML topology from one data set (combined data excluding a deviating data partition, e.g., ITS) fits into the likelihood distribution of another data set (e.g., ITS), which has a different ML topology. Although it is difficult to define a relevant null hypothesis because the ML topology is selected a posteriori, the relative size of the obtained P values give an indication of the relative impact of the topology parameter on each data partition. PLATO detected the strongest deviations from the model in the ITS data. This contrasts to the low impact of the topology parameter (P = 0.28) on the ITS data. The only incongruence between the two topologies reasonably well supported by MPB is the internal relationship of the group consisting of S. acaulis, S. nivalis, and S. fruticosa, where a sister-group relationship between S. acaulis and S. nivalis is supported by ITS (MPB 86%; Fig. 4), but contradicted by MPB analysis of the combined data excluding ITS, where a sister-group relationship between S. nivalis and S. fruticosa is supported instead (MPB 69%; data not shown). The ML estimates of the parameter values show that ITS deviates both in relative nucleotide substitution rates and is also slightly GC biased compared to the other partitions (and also the values in the model supplied to PLATO), which are AT biased (Table 4). The gamma parameter included in the model makes PLATO less sensitive to deviations in relative rates of nucleotide substitutions to some extent, whereas PLATO will detect significant differences in base composition (Grassly and Holmes, 1997). It therefore seems plausible that the model deviation stem from parameters other than topology in this case, and we conclude that the deviation observed with PLATO is due to the base composition in ITS. VOL. 53 The apparent incongruence detected in RPA2 and RPB2 cannot be explained by deviating rates of substitutions and/or base composition. The parameter values are in both datasets close to the values in the model used with PLATO (Table 4). Some topological incongruence was detected in both datasets. Both the position of subgenus Silene (two nodes with MPB <50% and 67%, respectively; Fig. 5) and the sister-group relationship of S. nivalis and S. fruticosa (MPB 100%) inferred from the RPA2 data were incongruent with the ML topology from combined data excluding RPA2. In RPB2, S. linnaeana was resolved at a slightly different position, but with poor MPB support (63% and 67%, respectively). The incongruences are not reflected in the likelihood ranking of the ML topologies (P = 0.30 and 0.12 excluding RPA2 and RPB2, respectively). It is notable, however, that many of the internal branches in the RPB2 topology are relatively short (Fig. 6). The branch lengths constitute a large fraction of the free parameters, and are dependent on the topology. One may argue that it is difficult to distinguish between topological differences and branch length differences, but we find the lack of strong differences in branching order and other parameters as indicative that it is the branch lengths themselves that are deviating. The biological explanation for this is obscure, but a reasonable hypothesis is that it is a random effect of rapid diversification of the group. Based on the ML ranking, we cannot reject the null hypothesis that all our data have evolved under the same topology, and we therefore choose to combine all data in a MPB analysis. The analysis of the combined data set resolved the previously poorly known generic relationships within Sileneae (Fig. 8). A denser taxon sampling is needed to infer the relationships within subgenus Silene, but our analysis supports a hypothesis of a monophyletic genus Silene. To resolve the relationships of the polyploid taxa, it is necessary to search more thoroughly for paralogues (see below) and include them in the analysis, instead of excluding nonorthologous sequences as is done in this analysis. The taxonomic conclusions based on the rps16 and ITS data sets (Oxelman et al., 2001) are further substantiated by the results presented here. General Utility of the Primer Design strategy The second PCR, with subunit-specific primers, yielded highly specific PCR products despite the lowstringency PCR conditions. With very few exceptions, all PCR products used for the sequences in this study were obtained at the first attempt, with either degenerated or Sileneae-specific primers. Some sequences were “missing” in the RPD2 data set, i.e., only one of the expected paralogues was found in spite of several attempts with cloning PCR products obtained with subunit specific primers. In addition, all attempts to amplify a “missing” paralogue with paralogue specific primers failed despite using several different polymerases, and varying PCR parameters such as annealing temperature, Mg2+ , and primer concentration. There may be several explanations of this. Differences in secondary structure (Buckler 2004 POPP AND OXELMAN—EVOLUTION OF A RNA POLYMERASE GENE FAMILY IN SILENE et al., 1997) or differences in primer mismatching might bias the PCR, resulting in recovering only one of the copies, i.e., PCR selection (Wagner et al., 1994). Another possibility is physical elimination of one of the redundant copies as is found in some allopolyploids (Shaked et al., 2001), or large inserts in pseudogenes (Tank and Sang, 2001) causing either inhibition or heavy bias of the PCR. Besides running several reactions under varying conditions and with several different sets of primers and/or paralogue specific primers (Rauscher et al., 2002) combined with cloning, PCR cannot take us any further and an ultimate answer to whether there is another copy or not cannot be given by PCR alone. Despite these inherent difficulties, we argue that our method is suitable for studying evolutionary relationships of lcnDNA sequence regions. The simultaneous analysis of multiple, presumably unlinked, lcnDNA sequence regions enables us to detect complicated evolutionary processes at the genome level, while offering a large amount of data for strong inferences of phylogenetic relationships at the “organismal” level. We suggest that this approach holds a very strong potential for phylogenetic studies of many organismal groups. CONCLUSIONS The addition of intron sequences from RPA2, RPB2, and RPD2 to the rps16 and ITS data sets results in a strongly supported phylogeny of the tribe Sileneae. Among-site rate variation is substantially lower in the RNA polymerase introns than in the rps16 intron and ITS. The analyses reveal evolutionary patterns consistent with gene duplication and incomplete concerted evolution in RPD2. Nested PCR with several sets of highly degenerated “universal” primers combined with cloning and subsequent design of more specific primers proves to be a powerful way to amplify and sequence low-copy nuclear DNA regions. ACKNOWLEDGMENTS We thank Chris Simon, Roberta Mason-Gamer, two anonymous reviewers, Katarina Andreasen, Magnus Lidén, Johan Nylander, and Sylvain Razafimandimbison for valuable comments; Reija Dufva, Inga Hallin, and Nahid Heidari for excellent help in the lab; Benjamin D. Hall for providing primer sequences; Mark W. Chase, the herbaria at WTU and UPS, and the Botanical Garden in Uppsala for providing plant material. This study was supported by Helge Ax:son Johnsons Stiftelse, The Swedish Research Council, The Royal Physiographic Society in Lund, The Royal Swedish Academy of Sciences, and Linnéstipendiet. R EFERENCES Baldwin, B. G., M. J. Sanderson, M. J. Porter, M. F. Wojciechowski, C. S. Campbell, and M. J. Donoghue. 1995. The ITS region of nuclear ribosomal DNA: A valuable source of evidence on angiosperm phylogeny. Ann. Miss. Bot. Garden 82:247–277. Brochmann, C., T. Nilsson, and T. M. Gabrielsen. 1996. A classical example of postglacial allopolyploid speciation re-examined using RAPD markers and nucleotide sequences: Saxifraga osloensis. Symbolae Botanicae Upsaliensis 31:75–89. Buckler, E. S., A. Ippolito, and T. P. Holtsford. 1997. The evolution of ribosomal DNA: Divergent paralogues and phylogenetic implications. Genetics 145:821–832. 931 Buckley, T. R. 2002. Model misspecification and probabilistic tests of topology: Evidence from empirical data sets. Syst. Biol. 51:509– 523. Clegg, M. T., M. P. Cummings, and M. L. Durbin. 1997. The evolution of plant nuclear genes. Proc. Nat. Acad. Sci. USA 94:7791–7798. Corriveau, J. L., and A. W. Coleman. 1988. Rapid screening method to detect potential biparental inheritance of plasmid DNA and result for over 200 angiosperm species. Am. J. Bot. 75:1443–1458. Cronn, R. C., R. L. Small, T. Haselkorn, and J. F. Wendel. 2002. Rapid diversification of the cotton genus (Gossypium: Malvaceae) revealed by analysis of sixteen nuclear and chloroplast genes. Am. J. Bot. 89:707– 725. Cronn, R. C., R. L. Small, and J. F. Wendel. 1999. Duplicated genes evolve independently after polyploid formation in cotton. Proc. Nat. Acad. Sci. USA 96:14406–14411. Denton, A. L., B. L. McConaughy, and B. D. Hall. 1998. Usefulness of RNA polymerase II coding sequences for estimation of green plant phylogeny. Mol. Biol. Evol. 15:1082–1085. Desfeux, C., and B. Lejeune. 1996. Systematics of euromediterranean Silene (Caryophyllaceae): Evidence from a phylogenetic analysis using ITS sequences. Comptes Rendus Acad. Sci. Serie ii—Sci. Vie–Life Sci. 319:351–358. Erixon, P, Svennblad, B, Britton, T., and Oxelman, B. 2003. The reliability of bayesian posterior probabilities and bootstrap frequencies in phylogenetics. Syst. Biol. 52:665–673. Felsenstein, J. 1973. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Syst. Zool. 22:240–249. Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 1978:401–410. Ferguson, C. J., and R. K. Jansen. 2002. A chloroplast DNA phylogeny of eastern Phlox (Polemoniaceae): Implications of congruence and incongruence with the its phylogeny. Am. J. Bot. 89:1324–1335. Goldman, N., J. P. Anderson, and A. G. Rodrigo. 2000. Likelihood-based tests of topologies in phylogenetics. Syst. Biol. 49:652–670. Gottlieb, L. D., and V. S. Ford. 1996. Phylogenetic relationships among the sections of Clarkia (Onagraceae) inferred from the nucleotide sequences of PgiC. Syst. Bot. 21:45–62. Grassly, N. C., and E. C. Holmes. 1997. A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol. Biol. Evol. 14:239–247. Hasegawa, M., and H. Kishino. 1989. Confidence-limits on the maximum-likelihood estimate of the hominoid tree from mitochondrial-DNA sequences. Evolution 43:672–677. Huelsenbeck, J. P., and J. J. Bull. 1996. A likelihood ratio test to detect conflicting phylogenetic signal. Syst. Biol. 45:92–98. Kelchner, S. A. 2002. Group II introns as phylogenetic tools: Structure function, and evolutionary constraints. Am. J. Bot. 89:1651–1669. Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximumlikelihood estimate of the evolutionary tree topologies from DNAsequence data, and the branching order in hominoidea. J. Mol. Evol. 29:170–179. Kosuge, K., K. Sawada, T. Denda, J. Adachi, and K. Watanabe. 1995. Phylogenetic relationships of some genera in the Ranuculaceae based on alcohol dehydrogenase genes. Plant Syst. emat Evol. 9(Suppl):263–271. Martin, A. P., and T. M. Burg. 2002. Perils of paralogy: Using HSP70 genes for inferring organismal phylogenies. Syst. Biol. 51:570–587. Martin, W., and R. G. Herrmann. 1998. Gene transfer from organelles to the nucleus: How much, what happens, and why? Plant Physiol. Rockville. Sept. 118:9–17. Mason-Gamer, R. J., C. F. Weil, and E. A. Kellogg. 1998. Granule-bound starch synthase: Structure, function, and phylogenetic utility. Mol. Biol. Evol. 15:1658–1673. Millar, A. A., and E. S. Dennis. 1996. The alcohol dehydrogenase genes of cotton. Plant Mol. Biol. 31:897–904. Nylander, J. A. A. 2003. MrModeltest, version 1.1b. Department of Systematic Zoology, EBC, Uppsala University, Sweden. E-mail: [email protected] Oxelman, B., and B. Bremer. 2000. Discovery of paralogous nuclear gene sequences coding for the second-largest subunit of RNA polymerase II (RPB2) and their phylogenetic utility in gentianales of the asterids. Mol. Biol. Evol. 17:1131–1145. 932 SYSTEMATIC BIOLOGY Oxelman, B., and M. Liden. 1995. Generic boundaries in the tribe Sileneae (Caryophyllaceae) as inferred from nuclear rDNA sequences. Taxon 44:525–542. Oxelman, B., M. Liden, and D. Berglund. 1997. Chloroplast rps16 intron phylogeny of the tribe Sileneae (Caryophyllaceae). Plant Syst. emat Evol. 206:393–410. Oxelman, B., M. Lidén, R. K. Rabeler, and M. Popp. 2001. A revised generic classification of the tribe Sileneae (Caryophyllaceae). Nordic J. Bot. 20:743–748. Pamilo, P., and M. Nei. 1988. Relationships between gene trees and species trees. Mol. Biol. Evol. 5:568–583. Popp, M., P. Erixon, F. Eggens, and B. Oxelman. In press. Origin and evolution of a circumpolar polyploid species complex in Silene (Caryophyllaceae) inferred from low copy nuclear RNA polymerase introns, rDNA, and chloroplast DNA. Syst. Bot. Popp, M., and B. Oxelman. 2001. Inferring the history of the polyploid Silene aegaea (Caryophyllaceae) using plastid and homoeologous nuclear DNA sequences. Mol. Phylogenet. Evol. 20:474–481. Posada, D. and K. A. Crandall. 1998. MODELTEST: testing the model of DNA substitution. Bioinformatics 14:817–818. Rauscher, J. T., J. J. Doyle, and A. H. D. Brown. 2002. Internal transcribed spacer repeat-specific primers and the analysis of hybridization in the Glycine tomentella (Leguminosae) polyploid complex. Mol. Ecol. 11:2691–2702. Rujan, T., and W. Martin. 2001. How many genes in Arabidopsis come from cyanobacteria? An estimate from 386 protein phylogenies. Trends in Genet. 17:113–120. Sang, T. 2002. Utility of low-copy nuclear gene sequences in plant phylogenetics. Crit. Rev. Biochem. Mol. Biol. 37:121–147. Shaked, H., K. Kashkush, H. Ozkan, M. Feldman, and A. A. Levy. 2001. Sequence elimination and cytosine methylation are rapid and reproducible responses of the genome to wide hybridization and allopolyploidy in wheat. Plant Cell 13:1749–1759. Shimodaira, H., and M. Hasegawa. 1999. Multiple comparisons of loglikelihoods with applications to phylogenetic inference. Mol. Biol. Evol. 16:1114–1116. Swofford, D. L. 2002. PAUP∗ . Phylogenetic analysis using parsimony (∗ and other methods). version 4.0b10. Sinauer Associates, Sunderland, Massachusetts. VOL. 53 Tank, D. C., and T. Sang. 2001. Phylogenetic utility of the glycerol-3phosphate acyltransferase gene: Evolution and implications in Paeonia (Paeoniaceae). Mol. Phylogenet. Evol. 19:421–429. The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796– 815. Wagner, A., N. Blackstone, P. Cartwright, M. Dick, B. Misof, P. Snow, G. P. Wagner, J. Bartels, M. Murtha, and J. Pendleton. 1994. Surveys of gene families using polymerase chain-reaction—PCR selection and PCR drift. Syst. Biol. 43:250–261. Walker, E. L., N. F. Weeden, C. B. Taylor, P. Green, and G. M. Coruzzi. 1995. Molecular evolution of duplicate copies of genes encoding cytosolic glutamine synthetase in Pisum sativum. Plant Mol. Biol. 29:1111–1125. Wen, J., M. Vanek-Krebitz, K. Hoffmann-Sommergruber, O. Scheiner, and H. Breiteneder. 1997. The potential of Betv1 homologues, a nuclear multigene family, as phylogenetic markers in flowering plants. Mol. Phylogenet. Evol. 8:317–333. Wendel, J. F., and J. J. Doyle. 1998. Phylogenetic incongruence: Window into genome history and molecular evolution. Pages 265–296 In Molecular systematics of plants. II (P. Soltis, D. Soltis, and J. Doyle, eds.). Kluwer Academic Press, Dordrecht. Wendel, J. F., A. Schnabel, and T. Seelanan. 1995. Bidirectional interlocus concerted evolution following allopolyploid speciation in cotton (Gossypium). Proc. Nat. Acad. Sci. USA 92:280–284. White, T. J., T. Bruns, S. Lee, and J. Taylor. 1990. Amplification and direct sequencing of fungal ribsomal RNA genes for phylogenetics. Pages 315–322 in PCR protocols: A guide to methods and applications (M. Innis, D. Gelfand, J. Sninsky, and T. J. White, eds.). Academic Press, San Diego, California. Yokoyama, S., and D. E. Harry. 1993. Molecular phylogeny and evolutionary rates of alcohol dehydrogenases in vertebrates and plants. Mol. Biol. Evol. 10:1215–1226. First submitted 29 July 2003; reviews returned 14 January 2004; final acceptance 23 August 2004 Associate Editor: Roberta Mason-Gamer
© Copyright 2026 Paperzz