Molecular Phylogenetics and Evolution 71 (2014) 142–148 Contents lists available at ScienceDirect Molecular Phylogenetics and Evolution journal homepage: www.elsevier.com/locate/ympev Short Communication Sampling strategies for improving tree accuracy and phylogenetic analyses: A case study in ciliate protists, with notes on the genus Paramecium Zhenzhen Yi a,c, Michaela Strüder-Kypke b, Xiaozhong Hu c, Xiaofeng Lin a,⇑, Weibo Song c,⇑ a b c Key Laboratory of Ecology and Environment Science in Guangdong Higher Education, School of Life Science, South China Normal University, Guangzhou 510631, China Department of Molecular and Cellular Biology, University of Guelph, Guelph, Ontario NIG 2W1, Canada Laboratory of Protozoology, Institute of Evolution & Marine Biodiversity, Ocean University of China, Qingdao 266003, China a r t i c l e i n f o Article history: Received 3 May 2013 Revised 20 November 2013 Accepted 24 November 2013 Available online 6 December 2013 Keywords: Missing data Phylogeny Multi-gene Ciliophora Paramecium a b s t r a c t In order to assess how dataset-selection for multi-gene analyses affects the accuracy of inferred phylogenetic trees in ciliates, we chose five genes and the genus Paramecium, one of the most widely used model protist genera, and compared tree topologies of the single- and multi-gene analyses. Our empirical study shows that: (1) Using multiple genes improves phylogenetic accuracy, even when their one-gene topologies are in conflict with each other. (2) The impact of missing data on phylogenetic accuracy is ambiguous: resolution power and topological similarity, but not number of represented taxa, are the most important criteria of a dataset for inclusion in concatenated analyses. (3) As an example, we tested the three classification models of the genus Paramecium with a multi-gene based approach, and only the monophyly of the subgenus Paramecium is supported. Ó 2013 Published by Elsevier Inc. 1. Introduction It is generally accepted that conclusions drawn from molecular phylogenies can be rather variable based on selection of gene markers and taxon sampling (Parfrey et al., 2010; Rokas and Carroll, 2005; Wiens et al., 2005; Yi and Song, 2011). The ideal dataset for molecular phylogenetic studies comprises all taxa of the in-group as well as gene markers which are capable of resolving relationships at different taxonomic levels. However, this is nearly impossible when dealing with microbial taxa that are often difficult to collect and culture (Yi et al., 2010). By retrieving gene sequences from GenBank, or other databases, researchers can include a large number of taxa and genes provided by others. Nevertheless, existing character-by-taxon databases are still fragmentary considering that the sets of genes sequenced for different taxa may show only limited overlap across phylogenetic studies (Sanderson and Driskell, 2003; Wiens, 2006). Therefore, it is important to evaluate whether genes with limited in-group taxa sampling should be included in combined phylogenetic analyses. This topic has been discussed in several simulation investigations (Gao and Norell, 1998; Wiens, 2003; Wilkinson and Benton, 1995) and empirical analyses (Huelsenbeck, 1991; Philippe et al., ⇑ Corresponding authors. Fax: +86 53282032283 (W. Song). E-mail addresses: [email protected] (X. Lin), [email protected] (W. Song). 1055-7903/$ - see front matter Ó 2013 Published by Elsevier Inc. http://dx.doi.org/10.1016/j.ympev.2013.11.013 2004; Wiens et al., 2005; Roure et al., 2013), and opposing conclusions have been drawn. Briefly, some researches argued that inferred phylogeny is not sensitive to missing data (Philippe et al., 2004; Roure et al., 2013), some showed inverse results (Gao and Norell, 1998; Huelsenbeck, 1991; Wilkinson and Benton, 1995), and others told us a variable story depending on other factors (Wiens, 2003; Wiens et al., 2005). To date, no empirical study focusing on ciliates has been conducted, although guidelines are urgently needed for this group. The genus Paramecium is an exemplary group of ciliates and well suited for an evaluation, because (a) phylogenetic relationships have been studied by many researchers; (b) the majority of species (two thirds) are represented in GenBank; and (c) most of selected gene sequences are only available for different subsets of these species. Paramecium has been known to science for more than 250 years, and is one of the most used model protist genera in numerous studies relating to different topics of, e.g., cytology, genetics, and general biology (Fokin et al., 2001). To date, 17 Paramecium morphospecies (the Paramecium aurelia complex regarded as one species) are recognized as valid (Fokin et al., 2004). Three models of subgeneric divisions have been suggested by different investigators, since morphology and gene sequence of Paramecium are diverse (Fokin et al., 2004; Jankowski, 1969, 1972; Woodruff, 1921). The results of molecular phylogenetic studies of Paramecium are rather variable based on gene markers and taxon sampling, and only limited numbers of species were Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148 included in most of previous investigations (Fokin et al., 2004; Hori et al., 2006; Hoshina et al., 2006; Maciejewska, 2007a; StrüderKypke et al., 2000a,b). In the present study, we perform multi-gene analyses using sequences available in GenBank (up until October 2012), focusing on relationships within Paramecium, and discuss sampling strategies and dataset selection for ciliates. 2. Methods Three nuclear gene regions, i.e. small subunit ribosomal RNA (SSU), internal transcribed spacer-5.8S-partial large subunit ribosomal RNA (ITS), cytosol-type 70-kDa heat-shock protein (HSP), and two mitochondrial genes, i.e. Cytochrome c oxidase subunit I (COI), Cytochrome b (CYT), are chosen in the present investigation, since only these five markers have been sequenced for several P. aurelia complex, as well as for more than two other Paramecium species. Tetrahymena pyriformis and Tetrahymena thermophila were chosen as outgroup species for all datasets. All sequences are available from GenBank and listed in Table S1. In the preliminary analyses, amino acid trees are usually less supported than nucleotide trees, therefore, only nucleotide trees for protein-coding genes are included in our investigation. The main purpose of the present investigation is to evaluate whether genes with limited taxon sampling should be included in combined phylogenetic analyses. In practice, genes with limited taxon sampling are usually not considered, until all well sampled genes are combined. Therefore, we only included poor sampled genes in our four- or five-gene combined datasets. In total, 12 data sets listed in Table S2 were assembled for subsequent analyses. Three well-sampled one-gene trees (SSU, HSP, COI), 3 well-sampled two-gene trees (2SSU_HSP, 2SSU_COI, 2HSP_COI), 1 well-sampled three-gene tree (3SSU_HSP_COI), 2 poor-sampled one-gene trees (ITS, CYT), 2 four-gene trees (4SSU_HSP_COI_ITS, 4SSU_HSP_ COI_CYT), 1 five-gene tree (5SSU_HSP_COI_ITS_CYT), are constructed. SSU and ITS gene sequences were aligned using ClustalW implemented in BIOEDIT 7.0.0 (Hall, 1999); ambiguously aligned positions were re-checked by eye in BIOEDIT. The SSU and ITS datasets contain 1634 and 1056 positions, respectively. Proteincoding gene nucleotide sequences, i.e. HSP, COI, CYT, were first aligned based on the predicted amino acid sequences using BIOEDIT 7.0.0, and then translated back into nucleotide sequences. Only the first two codon positions were kept in the final alignment datasets which included 252 (HSP), 478 (COI), and 412 (CYT) positions, respectively. Seven combined datasets were manually constructed based on the above aligned nucleotide datasets, and gaps are introduced for missing data (genes). All alignment files are available upon request. Maximum likelihood analyses, employing the GTR plus GAMMA substitution model, were conducted using RaxML-HPC2 (Stamatakis et al., 2008) at the CIPRES website (http://www.phylo.org/). Searches for the best tree were conducted starting from 1000 random trees, and 1000 bootstrap replicates were performed with the multi-parametric algorithm implemented in RAxML. Phylogenetic trees were viewed with MEGA 4 (Tamura et al., 2007). The best ML trees of seven 28-taxon datasets (two genes: 2SSU_HSP, 2SSU_COI, 2HSP_COI; three genes: 3SSU_HSP_COI; four genes: 4SSU_HSP_COI_ITS, 4SSU_HSP_COI_CYT; five genes: 5SSU_HSP_COI_ITS_CYT) and 100 unique bootstrap trees in this given dataset were compared using the approximately unbiased (AU) test (Shimodaira, 2002) as implemented in the CONSEL package (Shimodaira and Hasegawa, 2001), to test whether the topology produced by a given datasets was accepted or rejected by other datasets. 143 3. Results 3.1. Comparison of the multi-gene trees Multi-gene trees (3SSU_HSP_COI, 4SSU_HSP_COI_ITS, 4SSU_HSP_COI_CYT, 5SSU_HSP_COI_CYT_ITS) produce similar topologies, and all depict the phylogenetic assignments of three species, i.e. Paramecium bursaria, Paramecium putrinum and Paramecium duboscqui, as ambiguous. All Paramecium species are consistently divided into two clades with variable supports (82–99%), i.e. Group I (P. aurelia complex + Paramecium jenningsi + Paramecium schewiakoffi + Paramecium caudatum + Paramecium multimicronucleatum) and Group II (Paramecium woodruffi + Paramecium nephridiatum + Paramecium polycaryum + Paramecium calkinsi) (Fig. 1). Among the four species of Group II, P. woodruffi is sister taxon to P. nephridiatum (91–100% support). Within Group I, P. jenningsi and P. schewiakoffi fall into the clade of species of the P. aurelia complex, and we refer to this fully supported subgroup as Group Ia; Group Ia appears to be sister to P. multimicronucleatum and P. caudatum. Trees inferred from one- and two-gene datasets, on the other hand, show variable topologies (Figs. 2 and 3). Support values for these three groups (Group I, Ia, II) are usually lower and variable in one-gene and two-gene trees than in others (Figs. 1–3). 3.2. Comparison of the concatenated tree topologies based on genes included AU tests are used to test whether the topology produced by a given dataset is accepted or rejected by other datasets. The results (Table 1) show that multi-gene datasets produce more reliable tree topologies than two-gene ones. The three tested two-gene tree topologies are fully rejected (P < 0.01) in AU tests against all other six 28-taxon datasets, while multi-gene trees are rejected (P < 0.05) in fewer AU tests (Table 1). Among multi-gene trees, the AU tests show that the 4SSU_HSP_COI_ITS topology (Fig. 1b), rejected only by dataset 2SSU_HSP (Table 2), is least conflicting with all other datasets (Table 2). On the other hand, the 4SSU_HSP_COI_ITS dataset itself rejects topologies inferred from all other 28-taxon datasets except dataset 5SSU_HSP_COI_CYT_ITS. Similarly, dataset 5SSU_HSP_COI_CYT_ITS rejects topologies of all other 28-taxon datasets except topology from dataset 4SSU_HSP_ COI_ITS. Interestingly, the topology from dataset 5SSU_HSP_COI_ CYT_ITS is rejected by four out of six datasets (Table 1). The topologies of 3SSU_HSP_COI (Fig. 1a) and 4SSU_HSP_COI_CYT (Fig. 1d) are both rejected by the datasets 4SSU_HSP_COI_ITS and 5SSU_HSP_COI_CYT_ITS (Table 1). The comparison of the topology of all 12 best trees (Figs. 1–3) shows that topologies of multi-gene trees are more stable. The numbers of supported nodes (P50%) are nearly the same in two-gene trees (11–14) and multi-gene trees (11–13), and higher than in one-gene trees (5–12). However, the number of nodes supported (P50%) by more than six trees is larger in multi-gene trees (9 or 10) than in two-gene trees (7 or 8), and smallest in one-gene trees (2–4). The situation for singleton nodes is inverse (0 in multi-gene trees, 1–2 in two-gene trees, 1–6 in one-gene trees). Nodes with weak support in 3SSU_HSP_COI tend not to be supported after adding sparsely sampled gene makers. The poorly supported node N1 (75%) of this 3SSU_HSP_COI tree is not supported in the 4SSU_HSP_COI_CYT tree (Fig. 1, Table 2). Two other poorly supported nodes (N2 = 58%; N3 = 69%) are not (N2) or even less (N3 = 51%) supported in 4SSU_HSP_COI_ITS (Fig. 1, Table 2). In the tree topology of the 5SSU_HSP_COI_ CYT_ITS dataset, none of these three nodes is supported (Fig. 1, 144 Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148 Fig. 1. Phylogenetic trees of Paramecium inferred from datasets 3SSU_HSP_COI (A), 4SSU_HSP_COI_ITS (B), 5SSU_HSP_COI_CYT_ITS (C), 4SSU_HSP_COI_CYT (D). Numbers near nodes represent bootstrap support values (%), and supported values lower than 50% are not shown. Subgroups or genera according to Woodruff (1921) (a: aurelia, b: bursaria)/Jankowski (Jankowski (1969, 1972)) (P: Paramecium, H: Helianter, Cy: Cypreostoma)/Fokin et al. (2004) (P: Paramecium, Cy: Cypriostomum, Ch: Chloroparamecium, H: Helianter) are given after species name by symbols. Genes not sequenced for a given species are listed in the tree. N1–N3 represent nodes not supported in one of the multigene trees. BP of shared nodes in Figs. 1–3 are labeled in different colors. Numbers of shared nodes (1 to P6) in Figs. 1–3 are listed after colored boxes, and the exact numbers in a given tree are listed in the colored boxes. M1 and M2 represent nodes not supported in 3SSU_HSP_COI but in other multi-gene trees. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Table 2). By contrast, addition of ITS and CYT to the concatenated dataset can improve topologies of some nodes, although they show poor taxon representation. For example, compared to 3SSU_HSP_COI tree, the branching patterns of Group II in the 4SSU_HSP_COI_ITS and 5SSU_HSP_COI_CYT_ITS are slightly changed (P. calkinsi groups basally vs. P. polycaryum groups basally) and supported with much higher values (91–98% vs. 67–100%). Additionally, node M1 is supported in 4SSU_HSP_ COI_ITS and 5SSU_HSP_COI_CYT_ITS trees, and M2 is supported in 4SSU_HSP_COI_ITS, but these two nodes are absent in 3SSU_HSP_COI (Fig. 1). 3.3. Comparison of the one-gene tree topologies Among the one-gene datasets, SSU, COI, and HSP have the most complete taxon sampling. By comparing these three trees, the highest number of supported clades (P50%) within Group Ia is found in the HSP tree (8 nodes), and it is detected in the SSU and COI trees (7 nodes) among other species (Fig. 3). The total number of supported nodes varies from 8 (SSU) to 12 (HSP) among these three trees. Five and nine supported nodes (P50%) are present in CYT and ITS, respectively (Fig. 3). 4. Discussion 4.1. Phylogenetic relationships and classification of Paramecium species The monophyly of Group I is firmly supported in all inferred trees (Figs. 1–3). This confirms the ‘‘aurelia’’-group sensu Woodruff (1921) or the subgenus Paramecium sensu Jankowski (1969, 1972) and Fokin et al. (2004). Group I also appears as a monophyletic taxon in previous molecular investigations (Boscaro et al., 2012; Fokin et al., 2004; Hoshina et al., 2006; Przyboś et al., 2012; StrüderKypke and Lynn, 2010; Strüder-Kypke et al., 2000a,b). In addition, species of this group are morphologically similar (Fokin et al., 2004; Jankowski, 1969; Woodruff, 1921). However, other groups defined by the systems of Woodruff (1921) and Jankowski (1969, 1972) do not appear to be monophyletic. The monophyly of the ‘‘bursaria’’-group sensu Woodruff (1921) is rejected by our (Figs. 1– 3) and previous molecular analyses (Boscaro et al., 2012; Fokin et al., 2004; Przyboś et al., 2012; Strüder-Kypke and Lynn, 2010; Strüder-Kypke et al., 2000a,b). The subgenera Cypreostoma and Helianter sensu Jankowski (1969, 1972) are also demonstrated to be non-monophyletic (Boscaro et al., 2012; Fokin et al., 2004; Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148 145 Fig. 2. Phylogeny of Paramecium inferred from nucleotide datasets ITS (A), CYT (B), COI (C), HSP (D), SSU (E). Numbers near nodes represent ML support values (%), and supported values lower than 50% are not shown. Subgroups or genera according to Woodruff (1921) (a: aurelia, b: bursaria)/Jankowski (Jankowski (1969, 1972)) (P: Paramecium, H: Helianter, Cy: Cypreostoma)/Fokin et al. (2004) (P: Paramecium, Cy: Cypriostomum, Ch: Chloroparamecium, H: Helianter) are given after species name by symbols. BP of shared nodes in Figs. 1–3 are labeled in different colors. Numbers of shared nodes (1 to P6) in Figs. 1–3 are listed after colored boxes, and the exact numbers in a given tree are listed in the colored boxes. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Hoshina et al., 2006; Przyboś et al., 2012; Strüder-Kypke and Lynn, 2010; Strüder-Kypke et al., 2000a,b), although weak morphological similarity can be found among congeners (Jankowski, 1969, 1972). The classification proposed by Fokin et al. (2004) appears to be more appropriate. The subgenus Chloroparamecium is monotypic, and the subgenus Cypriostomum is shown as monophyletic in seven (5SSU_HSP_COI_CYT_ITS, 4SSU_HSP_COI_ITS, 3SSU_HSP_COI, 4SSU_HSP_COI_CYT, 2HSP_COI, 2SSU_COI, COI) out of twelve trees. The subgenus Helianter sensu Fokin et al. (2004), however, is monophyletic only in one tree (COI) out of twelve (Figs. 1–3). Similar results were obtained in previous analyses (Boscaro et al., 2012; Fokin et al., 2004; Hoshina et al., 2006; Przyboś et al., 2012; Strüder-Kypke and Lynn, 2010; Strüder-Kypke et al., 2000b). This is consistent with the suggestion of Fokin et al. (2004) that both subgenera, Cypriostomum and Helianter, might not be monophyletic, since the congeners are dissimilar to each other in many aspects. For instance, two congeners of Helianter (P. putrinum, P. duboscqui) have different nuclear reorganization process. Some Cypriostomum species are widely distributed, while others are only detected in limited distribution. P. jenningsi has been considered as the sister taxon of the P. aurelia-complex due to their similar morphological characters (Diller and Earl, 1958; Fokin et al., 2001; Strüder-Kypke et al., 2000a,b). This phylogenetic relationship was also supported by the COI tree (Strüder-Kypke and Lynn, 2010). However, other studies found that P. jenningsi groups within the P. aurelia complex in phylogenetic trees based on various gene markers, e.g., H4 (Maciejewska, 2007b), CYT (Barth et al., 2008), SSU (Boscaro et al., 2012; Fokin et al., 2004), ITS (Boscaro et al., 2012; Przyboś et al., 2012) and COI (Boscaro et al., 2012; Przyboś et al., 2012). In all our trees, P. jenningsi also clusters within the P. aurelia-complex (Figs. 1–3). Additionally, the recognition of P. jenningsi and the P. aurelia 146 Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148 Fig. 3. Phylogeny of Paramecium inferred from combined datasets 2HSP_COI (A), 2SSU_COI (B), 2SSU_HSP (C). Numbers near nodes represent ML support values (%), and supported values lower than 50% are not shown. Subgroups or genera according to Woodruff (1921) (a: aurelia, b: bursaria)/Jankowski (Jankowski (1969, 1972)) (P: Paramecium, H: Helianter, Cy: Cypreostoma)/Fokin et al. (2004) (P: Paramecium, Cy: Cypriostomum, Ch: Chloroparamecium, H: Helianter) are given after species name by symbols. BP of shared nodes in Figs. 1–3 are labeled in different colors. Numbers of shared nodes (1 to P6) in Figs. 1–3 are listed after colored boxes, and the exact numbers in a given tree are listed in the colored boxes. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Table 1 Results of the AU test of congruence of datasets. Topology 2SSU_COI 2SSU_HSP 2HSP_COI 3SSU_HSP_COI 4SSU_HSP_COI_ITS 4SSU_HSP_COI_CYT 5SSU_HSP_COI_CYT_ITS * ** Dataset 2SSU_COI 2SSU_HSP 2HSP_COI 3SSU_HSP_COI 4SSU_HSP_COI_ITS 4SSU_HSP_COI_CYT 5SSU_HSP_COI_CYT_ITS – <0.001** <0.001** 0.669 0.286 0.627 0.046* <0.001** – <0.001** 0.933 0.018* 0.132 0.034* <0.001** <0.001** – 0.832 0.209 0.220 0.024* <0.001** <0.001** <0.001** – 0.151 0.166 0.013* <0.001** <0.001** <0.001** 0.035* – <0.01** 0.097 <0.001** <0.001** <0.001** 0.478 0.108 – 0.200 <0.001** <0.001** <0.001** 0.008* 0.416 0.028* – P < 0.05 significant. P < 0.01 highly significant. complex as separate entities is only supported by three morphological characters (i.e. cell size, nuclei size, and shape of the macronuclear anlagen) (Diller and Earl, 1958). Based on our results, we conclude that P. jenningsi is closely related to P. aurelia complex, and they may have separated only recently. Similarly, all our trees (Figs. 1–3) and most previous molecular phylogenetic analyses (Boscaro et al., 2012; Fokin et al., 2004; Hoshina et al., 2006; Przyboś et al., 2012; Strüder-Kypke and Lynn, 2010) demonstrate that P. schewiakoffi places within the P. aurelia-complex clade (Figs. 1– 3), and is usually closely related to P. jenningsi (Figs. 1, 2a–e, 3). The CYT gene tree presented by Barth et al. (2008), however, sug- gests that P. schewiakoffi clusters as sister-group to the P. aurelia complex. P. schewiakoffi is morphologically distinct from P. jenningsi only by size and number of micronuclei. Therefore, the separation of P. schewiakoffi from the P. aurelia complex may be very recent, too. 4.2. Comparison of the tree topologies based on genes included All 12 trees mostly support the monophyly of groups I, Ia, and II (60–100%), with exception of Group II in the HSP, SSU and ITS trees (Fig. 2). This demonstrates that the five genes produce nearly the 147 Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148 Table 2 Support values of three nodes in four multi-gene trees. Each of the three nodes is absent in at least one of the multi-gene trees. N1 N2 N3 Absence or presence SSU_HSP_COI SSU_HSP_COI_CYT SSU_HSP_COI_ITS SSU_HSP_COI_CYT_ITS 75 – 89 – 58 53 – – 69 71 51 – Species at this node SSU_HSP_COI CYT ITS P. tetraurelia, P. octaurelia P. tetraurelia, P. octaurelia P. tetraurelia P. novaurelia, P. tredecaurelia, P. quadecaurelia P. novaurelia, P. tredecaurelia, P. quadecaurelia P. novaurelia P. duboscqui, P. bursaria, P. putrinum – P. bursaria, P. putrinum Congruence or incongruence CYT ITS s h s h h s s ITS/CYT tree is incongruent with SSU_HSP_COI tree in the specific node. h This node is missing in the ITS/CYT tree because not all species are represented in the respective dataset. same topologies at subgenus/subgroup phylogenetic scales. By contrast, the five one-gene trees differ in their topology at the species-level and within-complex-species-level placements (Fig. 2). Even the two mitochondrial gene trees are dissimilar, as are the three nuclear gene trees (Figs. 1–3). We found, however, that the different tree topologies at the species-level are negligible as long as the number of concatenated genes is three or higher (Table 1). At present, most ciliate phylogenies are mainly inferred from SSU gene sequences (Bapteste et al., 2002; Dunthorn et al., 2011; Gentekaki and Lynn, 2010; Gong et al., 2007, 2010; Wolf et al., 2004; Yi and Song, 2011). Researchers have not found a better gene maker, not even a congruent one. This leads to the situation that few species are sequenced for other gene markers. According to our above results, other gene markers, even those producing conflicting topologies compared to SSU, should be sequenced and analyzed in order to infer phylogenetic relationships among ciliates more reliably in the future. 4.3. Comparison of multi-gene topologies based on completeness of datasets As introduced in the Methods, 3SSU_COI_HSP serves as basic topology when discussing effects of addition of datasets with scarcely represented genes. In contrast to results of datasets with similar taxon sampling presented above, the addition of genes that are only represented in a small subset of the taxa (i.e. CYT, ITS) has ambiguous influence on the inference of relationships among Paramecium species. AU tests indicate that the addition of datasets with scarcely represented genes can be beneficial for (4SSU_COI_HSP_ITS), have no influence on (4SSU_COI_HSP_CYT), or even weaken (5SSU_COI_HSP_CYT_ITS) the resulting tree topology, respectively. By comparing tree topologies, we find that topological differences between CYT, ITS and 3SSU_COI_HSP, instead of sparsely taxon sampling, plays more important role in the absence of N1–N3 (Table 2). For example, if we compare the trees of 3SSU_HSP_COI and 4SSU_HSP_COI_ITS datasets, we detect that N2 is not supported in the latter due to absence of key species in the ITS tree, and that N3 is only weakly supported due to different topologies in the ITS and 3SSU_HSP_COI trees. Similarly, the node N1 is not supported in 4SSU_HSP_COI_CYT tree, because the analyses of datasets CYT and 3SSU_HSP_COI lead to different tree topologies. Differences in the tree topologies, therefore, also seem to be a key factor which reduces the accuracy of concatenated trees containing missing data. From Table 1 and Fig. 1, it seems that the major source of conflict in topologies inferred from datasets containing gene CYT (5SSU_HSP_COI_CYT_ITS, 4SSU_HSP_COI_CYT) is the missing node N1 that is relatively highly supported in other multi-gene datasets. Some of the previous studies (Gao and Norell, 1998; Huelsenbeck, 1991; Wilkinson and Benton, 1995) proposed that missing data could reduce the reliability of tree topology. However, our study does not indicate this. Average support values of three nodes in Group II are much higher in the 5SSU_COI_HSP_CYT_ITS tree (95%) than in the 3SSU_COI_HSP tree (85%), although in each species of the 5SSU_COI_HSP_CYT_ITS tree, at least one gene has not been sequenced. Furthermore, our AU test results rank the topology of the 4SSU_COI_HSP_ITS tree higher than the topology of the 4SSU_COI_HSP_CYT, although the number of taxa with ITS sequence is much lower than the number of taxa with CYT sequence. The number of supported (P50%) nodes is higher in the tree inferred from ITS gene sequences, than it is in the tree inferred from CYT gene sequences (Fig. 2). All these indicate that ITS gene is a better marker for resolving intra-generic relationships for Paramecium than CYT, and therefore 4SSU_COI_HSP_ITS topology is rejected by less datasets than 4SSU_COI_HSP_CYT topology. These results match predictions from simulations of Wiens (2003), suggesting that the missing data alone does not prevent the accurate placement of incomplete taxa missing this data. Instead, the success of their placement depends primarily on how well they can be placed by the datasets for which sequences are included. 4.4. Suggested sampling strategies for phylogenetic analyses In practice, different one-gene datasets usually generate phylogenetic hypotheses of some critical branches that either lack support or are in conflict with each other, and the taxa of these datasets are usually not fully overlapping. Using Paramecium as a case study, we try to propose sampling strategies for future ciliate phylogenetic analyses. At first, we should separate datasets into two groups, i.e. one-gene datasets with complete taxon sampling (SSU, HSP, COI), and one-gene datasets with incomplete datasets (CYT, ITS).The follow-up strategies will vary, depending on these two groups. (1) For complete datasets, increasing the number of analyzed genes is suggested in order to improve phylogenetic accuracy, even when one-gene tree topologies are in conflict with each other. We hope that further studies will give a suggested gene number for a general ciliate topology. (2) For incomplete datasets, we suggest that resolution power and similar topologies of gene markers but not taxa sampling size are more important criteria for the inclusion in the concatenated dataset. Acknowledgments This work was supported by the Natural Science Foundation of China (Project Nos. 41006098, 31030059, 31222050, 41176119), 148 Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148 the Research Fund for the Doctoral Program of Higher Education Project No. 20104407120006), China Postdoctoral Science Foundation (Project No. 20110491623), Foundation for Distinguished Young Talents in Higher Education of Guangdong, China (Project No. LYM10060, 2012LYM_0049). Appendix A. Supplementary material Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.ympev.2013. 11.013. References Bapteste, E., Brinkmann, H., Lee, J.A., Moore, D., Sensen, C., Gordon, P., Durufle, L., Gaasterland, T., Lopez, P., Müller, M., Philippe, H., 2002. The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proc. Nat. Acad. Sci. 99, 1414–1419. Barth, D., Przyboś, E., Fokin, S., Schlegel, M., Berendonk, T., 2008. Cytochrome b sequence data suggest rapid speciation within the Paramecium aurelia species complex. Mol. Phylogenet. Evol. 49, 669–673. Boscaro, V., Fokin, S., Verni, F., Petroni, G., 2012. Survey of Paramecium duboscqui using three markers and assessment of the molecular variability in the genus Paramecium. Mol. Phylogenet. Evol. 65, 1004–1013. Diller, W.F., Earl, P.R., 1958. Paramecium jenningsi, n. sp. J. Eukaryot. Microbiol. 5, 155–158. Dunthorn, M., Foissner, W., Katz, L.A., 2011. Expanding character sampling for ciliate phylogenetic inference using mitochondrial SSU-rDNA as a molecular marker. Protist 162, 85–99. Fokin, S.I., Przyboś, E., Chivilev, S.M., 2001. Nuclear reorganization variety in Paramecium (Ciliophora: Peniculida) and its possible evolution. Acta Protozool. 40, 249–261. Fokin, S.I., Przybos, E., Chivilev, S.M., Beier, C.L., Horn, M., Skotarczak, B., Wodecka, B., Fujishima, M., 2004. Morphological and molecular investigations of Paramecium schewiakoffi sp. nov. (Ciliophora, Oligohymenophorea) and current status of distribution and taxonomy of Paramecium spp.. Eur. J. Protistol. 40, 225–243. Gao, K., Norell, M.A., 1998. Taxonomic revision of Carusia (Reptilia: Squamata) from the Late Cretaceous of the Gobi Desert and phylogenetic relationships of anguimorphan lizards. Am. Mus. Nov. 3230, 1–51. Gentekaki, E., Lynn, D.H., 2010. Evidence for cryptic speciation in Carchesium polypinum Linnaeus, 1758 (Ciliophora: Peritrichia) inferred from mitochondrial, nuclear, and morphological markers. J. Eukaryot. Microbiol. 57, 508–519. Gong, J., Kim, S., Kim, S., Min, G., Roberts, D., Warren, A., Choi, J., 2007. Taxonomic redescriptions of two ciliates, Protogastrostyla pulchra n. g., n. comb. and Hemigastrostyla enigmatica (Ciliophora: Spirotrichea: Stichotrichia), with phylogenetic analyses based on 18S and 28S rRNA gene sequences. J. Eukaryot. Microbiol. 54, 306–316. Gong, Y., Xu, K., Zhan, Z., Yu, Y., Li, X., Villalobo, E., Feng, W., 2010. Alpha-tubulin and small subunit rRNA phylogenies of Peritrichs are congruent and do not support the clustering of Mobilids and Sessilids (Ciliophora, Oligohymenophorea). J. Eukaryot. Microbiol. 57, 265–272. Hall, T.A., 1999. BioEdit: a user-friendly biological sequence alignment editor and analysis program Windows 95/98/NT. Nucleic Acids Symp. Ser. 41, 95–98. Hori, M., Tomikawa, I., Przyboś, E., Fujishima, M., 2006. Comparison of the evolutionary distances among syngens and sibling species of Paramecium. Mol. Phylogenet. Evol. 38, 697–704. Hoshina, R., Hayashi, S., Imamura, N., 2006. Intraspecific genetic divergence of Paramecium bursaria and re-construction of the paramecian phylogenetic tree. Acta Protozool. 45, 337–386. Huelsenbeck, J.P., 1991. When are fossils better than extant taxa in phylogenetic analysis? Syst. Zool. 40, 458–469. Jankowski, A.W., 1969. A proposed taxonomy of the genus Paramecium Hill, 1752 (Ciliophora). Zool. Zh. 48, 30–40 (in Russian with English summary). Jankowski, A.W., 1972. Cytogenetics of Paramecium putrinum C. et L. (1858). Acta Protozool. 10, 285–394. Maciejewska, A., 2007a. Molecular phylogenetics of representative Paramecium species. Folia Biol. (Praha) 55, 1–2. Maciejewska, A., 2007b. Relationships of new sibling species of Paramecium jenningsi based on sequences of the histone H4 gene fragment. Eur. J. Protistol. 43, 125–130. Parfrey, L.W., Grant, J., Tekle, Y.I., Lasek-Nesselquist, E., Morrison, H.G., Sogin, M.L., Patterson, D.J., Katz, L.A., 2010. Broadly sampled multigene analyses yield a well-resolved eukaryotic tree of life. Syst. Biol. 59, 518–533. Philippe, H., Snell, E.A., Bapteste, E., Lopez, P., Holland, P.W.H., Casane, D., 2004. Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol. Biol. Evol. 21, 1740–1752. Przyboś, E., Tarcz, S., Potekhin, A., Rautian, M., Prajer, M., 2012. A two-locus molecular characterization of Paramecium calkinsi. Protist 163, 263–273. Rokas, A., Carroll, S.B., 2005. More genes or more taxa? the relative contribution of gene number and taxon number to phylogenetic accuracy. Mol. Biol. Evol. 22, 1337–1344. Roure, B., Baurain, D., Philippe, H., 2013. Impact of missing data on phylogenies inferred from empirical phylogenomic data sets. Mol. Biol. Evol. 30, 197–214. Sanderson, M.J., Driskell, A.C., 2003. The challenge of constructing large phylogenetic trees. Trends Plant. Sci. 8, 374–379. Shimodaira, H., 2002. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51, 492–508. Shimodaira, H., Hasegawa, M., 2001. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics 17, 1246–1247. Stamatakis, A., Hoover, P., Rougemont, J., 2008. A rapid bootstrap algorithm for the RAxML web servers. Syst. Biol. 57, 758–771. Strüder-Kypke, M.C., Lynn, D., 2010. Comparative analysis of the mitochondrial cytochrome c oxidase subunit I (COI) gene in ciliates (Alveolata, Ciliophora) and evaluation of its suitability as a biodiversity marker. Syst. Biodiv. 8, 131–148. Strüder-Kypke, M.C., Wright, A.D., Fokin, S.I., Lynn, D.H., 2000a. Phylogenetic relationships of the subclass Peniculia (Oligohymenophorea, Ciliophora) inferred from small subunit rRNA gene sequences. J. Eukaryot. Microbiol. 47, 419–429. Strüder-Kypke, M.C., Wright, A.G., Fokin, S.I., Lynn, D.H., 2000b. Phylogenetic relationships of the genus Paramecium inferred from small subunit rRNA gene sequences. Mol. Phylogenet. Evol. 14, 122–130. Tamura, K., Dudley, J., Nei, M., Kumar, S., 2007. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24, 1596–1599. Wiens, J.J., 2003. Missing data, incomplete taxa, and phylogenetic accuracy. Syst. Biol. 52, 528–538. Wiens, J.J., 2006. Missing data and the design of phylogenetic analyses. J. Biomed. Inform. 39, 34–42. Wiens, J.J., Reeder, T.W., Fetzner, J.W., Parkinson, C.L., Duellman, W.E., 2005. Hylid frog phylogeny and sampling strategies for species clades. Syst. Biol. 54, 719– 748. Wilkinson, M., Benton, M.J., 1995. Missing data and rhynchosaur phylogeny. Hist. Biol. 10, 137–150. Wolf, Y.I., Rogozin, I.B., Koonin, E.V., 2004. Coelomata and not ecdysozoa: evidence from genome-wide phylogenetic analysis. Genome Res. 14, 29–36. Woodruff, L.L., 1921. The structure, life history, and intrageneric relationships of Paramecium calkinsi, sp. nov. Biol. Bull. 41, 171–180. Yi, Z., Song, W., 2011. Evolution of the order Urostylida (Protozoa, Ciliophora): new hypotheses based on multi-gene information and identification of localized incongruence. PLoS ONE 6, e17471, doi:17410.11371/journal.pone.0017471. Yi, Z., Dunthorn, M., Song, W., Stoeck, T., 2010. Increasing taxon sampling using both unidentified environmental sequences and identified cultures improves phylogenetic inference in the Prorodontida (Ciliophora, Prostomatea). Mol. Phylogenet. Evol. 57, 937–941.
© Copyright 2024 Paperzz