Mol.

Molecular Phylogenetics and Evolution 71 (2014) 142–148
Contents lists available at ScienceDirect
Molecular Phylogenetics and Evolution
journal homepage: www.elsevier.com/locate/ympev
Short Communication
Sampling strategies for improving tree accuracy and phylogenetic
analyses: A case study in ciliate protists, with notes on the genus
Paramecium
Zhenzhen Yi a,c, Michaela Strüder-Kypke b, Xiaozhong Hu c, Xiaofeng Lin a,⇑, Weibo Song c,⇑
a
b
c
Key Laboratory of Ecology and Environment Science in Guangdong Higher Education, School of Life Science, South China Normal University, Guangzhou 510631, China
Department of Molecular and Cellular Biology, University of Guelph, Guelph, Ontario NIG 2W1, Canada
Laboratory of Protozoology, Institute of Evolution & Marine Biodiversity, Ocean University of China, Qingdao 266003, China
a r t i c l e
i n f o
Article history:
Received 3 May 2013
Revised 20 November 2013
Accepted 24 November 2013
Available online 6 December 2013
Keywords:
Missing data
Phylogeny
Multi-gene
Ciliophora
Paramecium
a b s t r a c t
In order to assess how dataset-selection for multi-gene analyses affects the accuracy of inferred phylogenetic trees in ciliates, we chose five genes and the genus Paramecium, one of the most widely used model
protist genera, and compared tree topologies of the single- and multi-gene analyses. Our empirical study
shows that: (1) Using multiple genes improves phylogenetic accuracy, even when their one-gene topologies are in conflict with each other. (2) The impact of missing data on phylogenetic accuracy is ambiguous: resolution power and topological similarity, but not number of represented taxa, are the most
important criteria of a dataset for inclusion in concatenated analyses. (3) As an example, we tested the
three classification models of the genus Paramecium with a multi-gene based approach, and only the
monophyly of the subgenus Paramecium is supported.
Ó 2013 Published by Elsevier Inc.
1. Introduction
It is generally accepted that conclusions drawn from molecular
phylogenies can be rather variable based on selection of gene
markers and taxon sampling (Parfrey et al., 2010; Rokas and
Carroll, 2005; Wiens et al., 2005; Yi and Song, 2011). The ideal
dataset for molecular phylogenetic studies comprises all taxa of
the in-group as well as gene markers which are capable of resolving relationships at different taxonomic levels. However, this is
nearly impossible when dealing with microbial taxa that are often
difficult to collect and culture (Yi et al., 2010). By retrieving gene
sequences from GenBank, or other databases, researchers can include a large number of taxa and genes provided by others.
Nevertheless, existing character-by-taxon databases are still fragmentary considering that the sets of genes sequenced for different
taxa may show only limited overlap across phylogenetic studies
(Sanderson and Driskell, 2003; Wiens, 2006). Therefore, it is important to evaluate whether genes with limited in-group taxa sampling should be included in combined phylogenetic analyses. This
topic has been discussed in several simulation investigations
(Gao and Norell, 1998; Wiens, 2003; Wilkinson and Benton,
1995) and empirical analyses (Huelsenbeck, 1991; Philippe et al.,
⇑ Corresponding authors. Fax: +86 53282032283 (W. Song).
E-mail addresses: [email protected] (X. Lin), [email protected] (W. Song).
1055-7903/$ - see front matter Ó 2013 Published by Elsevier Inc.
http://dx.doi.org/10.1016/j.ympev.2013.11.013
2004; Wiens et al., 2005; Roure et al., 2013), and opposing conclusions have been drawn. Briefly, some researches argued that inferred phylogeny is not sensitive to missing data (Philippe et al.,
2004; Roure et al., 2013), some showed inverse results (Gao and
Norell, 1998; Huelsenbeck, 1991; Wilkinson and Benton, 1995),
and others told us a variable story depending on other factors
(Wiens, 2003; Wiens et al., 2005). To date, no empirical study
focusing on ciliates has been conducted, although guidelines are
urgently needed for this group.
The genus Paramecium is an exemplary group of ciliates and
well suited for an evaluation, because (a) phylogenetic relationships have been studied by many researchers; (b) the majority of
species (two thirds) are represented in GenBank; and (c) most of
selected gene sequences are only available for different subsets
of these species. Paramecium has been known to science for more
than 250 years, and is one of the most used model protist genera
in numerous studies relating to different topics of, e.g., cytology,
genetics, and general biology (Fokin et al., 2001). To date, 17
Paramecium morphospecies (the Paramecium aurelia complex regarded as one species) are recognized as valid (Fokin et al.,
2004). Three models of subgeneric divisions have been suggested
by different investigators, since morphology and gene sequence
of Paramecium are diverse (Fokin et al., 2004; Jankowski, 1969,
1972; Woodruff, 1921). The results of molecular phylogenetic
studies of Paramecium are rather variable based on gene markers
and taxon sampling, and only limited numbers of species were
Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148
included in most of previous investigations (Fokin et al., 2004; Hori
et al., 2006; Hoshina et al., 2006; Maciejewska, 2007a; StrüderKypke et al., 2000a,b).
In the present study, we perform multi-gene analyses using sequences available in GenBank (up until October 2012), focusing on
relationships within Paramecium, and discuss sampling strategies
and dataset selection for ciliates.
2. Methods
Three nuclear gene regions, i.e. small subunit ribosomal RNA
(SSU), internal transcribed spacer-5.8S-partial large subunit ribosomal RNA (ITS), cytosol-type 70-kDa heat-shock protein (HSP),
and two mitochondrial genes, i.e. Cytochrome c oxidase subunit I
(COI), Cytochrome b (CYT), are chosen in the present investigation,
since only these five markers have been sequenced for several P.
aurelia complex, as well as for more than two other Paramecium
species. Tetrahymena pyriformis and Tetrahymena thermophila were
chosen as outgroup species for all datasets. All sequences are available from GenBank and listed in Table S1. In the preliminary analyses, amino acid trees are usually less supported than nucleotide
trees, therefore, only nucleotide trees for protein-coding genes
are included in our investigation.
The main purpose of the present investigation is to evaluate
whether genes with limited taxon sampling should be included
in combined phylogenetic analyses. In practice, genes with limited
taxon sampling are usually not considered, until all well sampled
genes are combined. Therefore, we only included poor sampled
genes in our four- or five-gene combined datasets. In total, 12 data
sets listed in Table S2 were assembled for subsequent analyses.
Three well-sampled one-gene trees (SSU, HSP, COI), 3 well-sampled two-gene trees (2SSU_HSP, 2SSU_COI, 2HSP_COI), 1 well-sampled three-gene tree (3SSU_HSP_COI), 2 poor-sampled one-gene
trees (ITS, CYT), 2 four-gene trees (4SSU_HSP_COI_ITS, 4SSU_HSP_
COI_CYT), 1 five-gene tree (5SSU_HSP_COI_ITS_CYT), are
constructed.
SSU and ITS gene sequences were aligned using ClustalW implemented in BIOEDIT 7.0.0 (Hall, 1999); ambiguously aligned positions were re-checked by eye in BIOEDIT. The SSU and ITS
datasets contain 1634 and 1056 positions, respectively. Proteincoding gene nucleotide sequences, i.e. HSP, COI, CYT, were first
aligned based on the predicted amino acid sequences using BIOEDIT 7.0.0, and then translated back into nucleotide sequences. Only
the first two codon positions were kept in the final alignment datasets which included 252 (HSP), 478 (COI), and 412 (CYT) positions,
respectively. Seven combined datasets were manually constructed
based on the above aligned nucleotide datasets, and gaps are introduced for missing data (genes). All alignment files are available
upon request.
Maximum likelihood analyses, employing the GTR plus GAMMA
substitution model, were conducted using RaxML-HPC2 (Stamatakis et al., 2008) at the CIPRES website (http://www.phylo.org/).
Searches for the best tree were conducted starting from 1000 random trees, and 1000 bootstrap replicates were performed with the
multi-parametric algorithm implemented in RAxML. Phylogenetic
trees were viewed with MEGA 4 (Tamura et al., 2007).
The best ML trees of seven 28-taxon datasets (two genes:
2SSU_HSP, 2SSU_COI, 2HSP_COI; three genes: 3SSU_HSP_COI; four
genes: 4SSU_HSP_COI_ITS, 4SSU_HSP_COI_CYT; five genes:
5SSU_HSP_COI_ITS_CYT) and 100 unique bootstrap trees in this given dataset were compared using the approximately unbiased
(AU) test (Shimodaira, 2002) as implemented in the CONSEL
package (Shimodaira and Hasegawa, 2001), to test whether the
topology produced by a given datasets was accepted or rejected
by other datasets.
143
3. Results
3.1. Comparison of the multi-gene trees
Multi-gene
trees
(3SSU_HSP_COI,
4SSU_HSP_COI_ITS,
4SSU_HSP_COI_CYT, 5SSU_HSP_COI_CYT_ITS) produce similar
topologies, and all depict the phylogenetic assignments of three
species, i.e. Paramecium bursaria, Paramecium putrinum and
Paramecium duboscqui, as ambiguous. All Paramecium species are
consistently divided into two clades with variable supports
(82–99%), i.e. Group I (P. aurelia complex + Paramecium jenningsi +
Paramecium schewiakoffi + Paramecium caudatum + Paramecium
multimicronucleatum) and Group II (Paramecium woodruffi
+ Paramecium nephridiatum + Paramecium polycaryum + Paramecium calkinsi) (Fig. 1). Among the four species of Group II, P. woodruffi
is sister taxon to P. nephridiatum (91–100% support). Within Group I,
P. jenningsi and P. schewiakoffi fall into the clade of species of the
P. aurelia complex, and we refer to this fully supported subgroup
as Group Ia; Group Ia appears to be sister to P. multimicronucleatum
and P. caudatum. Trees inferred from one- and two-gene datasets,
on the other hand, show variable topologies (Figs. 2 and 3).
Support values for these three groups (Group I, Ia, II) are usually
lower and variable in one-gene and two-gene trees than in others
(Figs. 1–3).
3.2. Comparison of the concatenated tree topologies based on genes
included
AU tests are used to test whether the topology produced by a
given dataset is accepted or rejected by other datasets. The results
(Table 1) show that multi-gene datasets produce more reliable tree
topologies than two-gene ones. The three tested two-gene tree
topologies are fully rejected (P < 0.01) in AU tests against all other
six 28-taxon datasets, while multi-gene trees are rejected
(P < 0.05) in fewer AU tests (Table 1). Among multi-gene trees,
the AU tests show that the 4SSU_HSP_COI_ITS topology (Fig. 1b),
rejected only by dataset 2SSU_HSP (Table 2), is least conflicting
with all other datasets (Table 2). On the other hand, the
4SSU_HSP_COI_ITS dataset itself rejects topologies inferred from
all other 28-taxon datasets except dataset 5SSU_HSP_COI_CYT_ITS.
Similarly, dataset 5SSU_HSP_COI_CYT_ITS rejects topologies of all
other 28-taxon datasets except topology from dataset 4SSU_HSP_
COI_ITS. Interestingly, the topology from dataset 5SSU_HSP_COI_
CYT_ITS is rejected by four out of six datasets (Table 1). The topologies of 3SSU_HSP_COI (Fig. 1a) and 4SSU_HSP_COI_CYT (Fig. 1d)
are both rejected by the datasets 4SSU_HSP_COI_ITS and
5SSU_HSP_COI_CYT_ITS (Table 1).
The comparison of the topology of all 12 best trees (Figs. 1–3)
shows that topologies of multi-gene trees are more stable. The
numbers of supported nodes (P50%) are nearly the same in
two-gene trees (11–14) and multi-gene trees (11–13), and higher
than in one-gene trees (5–12). However, the number of nodes
supported (P50%) by more than six trees is larger in multi-gene
trees (9 or 10) than in two-gene trees (7 or 8), and smallest in
one-gene trees (2–4). The situation for singleton nodes is inverse
(0 in multi-gene trees, 1–2 in two-gene trees, 1–6 in one-gene
trees).
Nodes with weak support in 3SSU_HSP_COI tend not to be
supported after adding sparsely sampled gene makers. The
poorly supported node N1 (75%) of this 3SSU_HSP_COI tree is
not supported in the 4SSU_HSP_COI_CYT tree (Fig. 1, Table 2).
Two other poorly supported nodes (N2 = 58%; N3 = 69%) are
not (N2) or even less (N3 = 51%) supported in 4SSU_HSP_COI_ITS
(Fig. 1, Table 2). In the tree topology of the 5SSU_HSP_COI_
CYT_ITS dataset, none of these three nodes is supported (Fig. 1,
144
Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148
Fig. 1. Phylogenetic trees of Paramecium inferred from datasets 3SSU_HSP_COI (A), 4SSU_HSP_COI_ITS (B), 5SSU_HSP_COI_CYT_ITS (C), 4SSU_HSP_COI_CYT (D). Numbers
near nodes represent bootstrap support values (%), and supported values lower than 50% are not shown. Subgroups or genera according to Woodruff (1921) (a: aurelia, b:
bursaria)/Jankowski (Jankowski (1969, 1972)) (P: Paramecium, H: Helianter, Cy: Cypreostoma)/Fokin et al. (2004) (P: Paramecium, Cy: Cypriostomum, Ch: Chloroparamecium, H:
Helianter) are given after species name by symbols. Genes not sequenced for a given species are listed in the tree. N1–N3 represent nodes not supported in one of the multigene trees. BP of shared nodes in Figs. 1–3 are labeled in different colors. Numbers of shared nodes (1 to P6) in Figs. 1–3 are listed after colored boxes, and the exact numbers
in a given tree are listed in the colored boxes. M1 and M2 represent nodes not supported in 3SSU_HSP_COI but in other multi-gene trees. (For interpretation of the references
to color in this figure legend, the reader is referred to the web version of this article.)
Table 2). By contrast, addition of ITS and CYT to the concatenated dataset can improve topologies of some nodes, although
they show poor taxon representation. For example, compared
to 3SSU_HSP_COI tree, the branching patterns of Group II in
the 4SSU_HSP_COI_ITS and 5SSU_HSP_COI_CYT_ITS are slightly
changed (P. calkinsi groups basally vs. P. polycaryum groups
basally) and supported with much higher values (91–98% vs.
67–100%). Additionally, node M1 is supported in 4SSU_HSP_
COI_ITS and 5SSU_HSP_COI_CYT_ITS trees, and M2 is supported
in 4SSU_HSP_COI_ITS, but these two nodes are absent in
3SSU_HSP_COI (Fig. 1).
3.3. Comparison of the one-gene tree topologies
Among the one-gene datasets, SSU, COI, and HSP have the most
complete taxon sampling. By comparing these three trees, the
highest number of supported clades (P50%) within Group Ia is
found in the HSP tree (8 nodes), and it is detected in the SSU and
COI trees (7 nodes) among other species (Fig. 3). The total number
of supported nodes varies from 8 (SSU) to 12 (HSP) among these
three trees. Five and nine supported nodes (P50%) are present in
CYT and ITS, respectively (Fig. 3).
4. Discussion
4.1. Phylogenetic relationships and classification of Paramecium
species
The monophyly of Group I is firmly supported in all inferred
trees (Figs. 1–3). This confirms the ‘‘aurelia’’-group sensu Woodruff
(1921) or the subgenus Paramecium sensu Jankowski (1969, 1972)
and Fokin et al. (2004). Group I also appears as a monophyletic taxon in previous molecular investigations (Boscaro et al., 2012; Fokin
et al., 2004; Hoshina et al., 2006; Przyboś et al., 2012; StrüderKypke and Lynn, 2010; Strüder-Kypke et al., 2000a,b). In addition,
species of this group are morphologically similar (Fokin et al.,
2004; Jankowski, 1969; Woodruff, 1921). However, other groups
defined by the systems of Woodruff (1921) and Jankowski (1969,
1972) do not appear to be monophyletic. The monophyly of the
‘‘bursaria’’-group sensu Woodruff (1921) is rejected by our (Figs. 1–
3) and previous molecular analyses (Boscaro et al., 2012; Fokin
et al., 2004; Przyboś et al., 2012; Strüder-Kypke and Lynn, 2010;
Strüder-Kypke et al., 2000a,b). The subgenera Cypreostoma and
Helianter sensu Jankowski (1969, 1972) are also demonstrated to
be non-monophyletic (Boscaro et al., 2012; Fokin et al., 2004;
Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148
145
Fig. 2. Phylogeny of Paramecium inferred from nucleotide datasets ITS (A), CYT (B), COI (C), HSP (D), SSU (E). Numbers near nodes represent ML support values (%), and
supported values lower than 50% are not shown. Subgroups or genera according to Woodruff (1921) (a: aurelia, b: bursaria)/Jankowski (Jankowski (1969, 1972)) (P:
Paramecium, H: Helianter, Cy: Cypreostoma)/Fokin et al. (2004) (P: Paramecium, Cy: Cypriostomum, Ch: Chloroparamecium, H: Helianter) are given after species name by
symbols. BP of shared nodes in Figs. 1–3 are labeled in different colors. Numbers of shared nodes (1 to P6) in Figs. 1–3 are listed after colored boxes, and the exact numbers in
a given tree are listed in the colored boxes. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Hoshina et al., 2006; Przyboś et al., 2012; Strüder-Kypke and Lynn,
2010; Strüder-Kypke et al., 2000a,b), although weak morphological
similarity can be found among congeners (Jankowski, 1969, 1972).
The classification proposed by Fokin et al. (2004) appears to be
more appropriate. The subgenus Chloroparamecium is monotypic,
and the subgenus Cypriostomum is shown as monophyletic in seven (5SSU_HSP_COI_CYT_ITS, 4SSU_HSP_COI_ITS, 3SSU_HSP_COI,
4SSU_HSP_COI_CYT, 2HSP_COI, 2SSU_COI, COI) out of twelve trees.
The subgenus Helianter sensu Fokin et al. (2004), however, is
monophyletic only in one tree (COI) out of twelve (Figs. 1–3). Similar results were obtained in previous analyses (Boscaro et al.,
2012; Fokin et al., 2004; Hoshina et al., 2006; Przyboś et al.,
2012; Strüder-Kypke and Lynn, 2010; Strüder-Kypke et al.,
2000b). This is consistent with the suggestion of Fokin et al.
(2004) that both subgenera, Cypriostomum and Helianter, might
not be monophyletic, since the congeners are dissimilar to each
other in many aspects. For instance, two congeners of Helianter
(P. putrinum, P. duboscqui) have different nuclear reorganization
process. Some Cypriostomum species are widely distributed, while
others are only detected in limited distribution.
P. jenningsi has been considered as the sister taxon of the P. aurelia-complex due to their similar morphological characters (Diller
and Earl, 1958; Fokin et al., 2001; Strüder-Kypke et al., 2000a,b).
This phylogenetic relationship was also supported by the COI tree
(Strüder-Kypke and Lynn, 2010). However, other studies found
that P. jenningsi groups within the P. aurelia complex in phylogenetic trees based on various gene markers, e.g., H4 (Maciejewska,
2007b), CYT (Barth et al., 2008), SSU (Boscaro et al., 2012; Fokin
et al., 2004), ITS (Boscaro et al., 2012; Przyboś et al., 2012) and
COI (Boscaro et al., 2012; Przyboś et al., 2012). In all our trees, P.
jenningsi also clusters within the P. aurelia-complex (Figs. 1–3).
Additionally, the recognition of P. jenningsi and the P. aurelia
146
Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148
Fig. 3. Phylogeny of Paramecium inferred from combined datasets 2HSP_COI (A), 2SSU_COI (B), 2SSU_HSP (C). Numbers near nodes represent ML support values (%), and
supported values lower than 50% are not shown. Subgroups or genera according to Woodruff (1921) (a: aurelia, b: bursaria)/Jankowski (Jankowski (1969, 1972)) (P:
Paramecium, H: Helianter, Cy: Cypreostoma)/Fokin et al. (2004) (P: Paramecium, Cy: Cypriostomum, Ch: Chloroparamecium, H: Helianter) are given after species name by
symbols. BP of shared nodes in Figs. 1–3 are labeled in different colors. Numbers of shared nodes (1 to P6) in Figs. 1–3 are listed after colored boxes, and the exact numbers in
a given tree are listed in the colored boxes. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Table 1
Results of the AU test of congruence of datasets.
Topology
2SSU_COI
2SSU_HSP
2HSP_COI
3SSU_HSP_COI
4SSU_HSP_COI_ITS
4SSU_HSP_COI_CYT
5SSU_HSP_COI_CYT_ITS
*
**
Dataset
2SSU_COI
2SSU_HSP
2HSP_COI
3SSU_HSP_COI
4SSU_HSP_COI_ITS
4SSU_HSP_COI_CYT
5SSU_HSP_COI_CYT_ITS
–
<0.001**
<0.001**
0.669
0.286
0.627
0.046*
<0.001**
–
<0.001**
0.933
0.018*
0.132
0.034*
<0.001**
<0.001**
–
0.832
0.209
0.220
0.024*
<0.001**
<0.001**
<0.001**
–
0.151
0.166
0.013*
<0.001**
<0.001**
<0.001**
0.035*
–
<0.01**
0.097
<0.001**
<0.001**
<0.001**
0.478
0.108
–
0.200
<0.001**
<0.001**
<0.001**
0.008*
0.416
0.028*
–
P < 0.05 significant.
P < 0.01 highly significant.
complex as separate entities is only supported by three morphological characters (i.e. cell size, nuclei size, and shape of the macronuclear anlagen) (Diller and Earl, 1958). Based on our results, we
conclude that P. jenningsi is closely related to P. aurelia complex,
and they may have separated only recently. Similarly, all our trees
(Figs. 1–3) and most previous molecular phylogenetic analyses
(Boscaro et al., 2012; Fokin et al., 2004; Hoshina et al., 2006; Przyboś et al., 2012; Strüder-Kypke and Lynn, 2010) demonstrate that
P. schewiakoffi places within the P. aurelia-complex clade (Figs. 1–
3), and is usually closely related to P. jenningsi (Figs. 1, 2a–e, 3).
The CYT gene tree presented by Barth et al. (2008), however, sug-
gests that P. schewiakoffi clusters as sister-group to the P. aurelia
complex. P. schewiakoffi is morphologically distinct from P. jenningsi only by size and number of micronuclei. Therefore, the separation of P. schewiakoffi from the P. aurelia complex may be very
recent, too.
4.2. Comparison of the tree topologies based on genes included
All 12 trees mostly support the monophyly of groups I, Ia, and II
(60–100%), with exception of Group II in the HSP, SSU and ITS trees
(Fig. 2). This demonstrates that the five genes produce nearly the
147
Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148
Table 2
Support values of three nodes in four multi-gene trees. Each of the three nodes is absent in at least one of the multi-gene trees.
N1
N2
N3
Absence or presence
SSU_HSP_COI
SSU_HSP_COI_CYT
SSU_HSP_COI_ITS
SSU_HSP_COI_CYT_ITS
75
–
89
–
58
53
–
–
69
71
51
–
Species at this node
SSU_HSP_COI
CYT
ITS
P. tetraurelia, P. octaurelia
P. tetraurelia, P. octaurelia
P. tetraurelia
P. novaurelia, P. tredecaurelia, P. quadecaurelia
P. novaurelia, P. tredecaurelia, P. quadecaurelia
P. novaurelia
P. duboscqui, P. bursaria, P. putrinum
–
P. bursaria, P. putrinum
Congruence or incongruence
CYT
ITS
s
h
s
h
h
s
s ITS/CYT tree is incongruent with SSU_HSP_COI tree in the specific node.
h This node is missing in the ITS/CYT tree because not all species are represented in the respective dataset.
same topologies at subgenus/subgroup phylogenetic scales. By
contrast, the five one-gene trees differ in their topology at the species-level and within-complex-species-level placements (Fig. 2).
Even the two mitochondrial gene trees are dissimilar, as are the
three nuclear gene trees (Figs. 1–3). We found, however, that the
different tree topologies at the species-level are negligible as long
as the number of concatenated genes is three or higher (Table 1).
At present, most ciliate phylogenies are mainly inferred from
SSU gene sequences (Bapteste et al., 2002; Dunthorn et al., 2011;
Gentekaki and Lynn, 2010; Gong et al., 2007, 2010; Wolf et al.,
2004; Yi and Song, 2011). Researchers have not found a better gene
maker, not even a congruent one. This leads to the situation that
few species are sequenced for other gene markers. According to
our above results, other gene markers, even those producing conflicting topologies compared to SSU, should be sequenced and analyzed in order to infer phylogenetic relationships among ciliates
more reliably in the future.
4.3. Comparison of multi-gene topologies based on completeness of
datasets
As introduced in the Methods, 3SSU_COI_HSP serves as basic
topology when discussing effects of addition of datasets with scarcely represented genes. In contrast to results of datasets with similar taxon sampling presented above, the addition of genes that are
only represented in a small subset of the taxa (i.e. CYT, ITS) has
ambiguous influence on the inference of relationships among Paramecium species. AU tests indicate that the addition of datasets
with scarcely represented genes can be beneficial for (4SSU_COI_HSP_ITS), have no influence on (4SSU_COI_HSP_CYT), or even
weaken (5SSU_COI_HSP_CYT_ITS) the resulting tree topology,
respectively. By comparing tree topologies, we find that topological
differences between CYT, ITS and 3SSU_COI_HSP, instead of sparsely taxon sampling, plays more important role in the absence of
N1–N3 (Table 2). For example, if we compare the trees of
3SSU_HSP_COI and 4SSU_HSP_COI_ITS datasets, we detect that
N2 is not supported in the latter due to absence of key species in
the ITS tree, and that N3 is only weakly supported due to different
topologies in the ITS and 3SSU_HSP_COI trees. Similarly, the node
N1 is not supported in 4SSU_HSP_COI_CYT tree, because the analyses of datasets CYT and 3SSU_HSP_COI lead to different tree
topologies. Differences in the tree topologies, therefore, also seem
to be a key factor which reduces the accuracy of concatenated trees
containing missing data. From Table 1 and Fig. 1, it seems that the
major source of conflict in topologies inferred from datasets containing gene CYT (5SSU_HSP_COI_CYT_ITS, 4SSU_HSP_COI_CYT) is
the missing node N1 that is relatively highly supported in other
multi-gene datasets. Some of the previous studies (Gao and Norell,
1998; Huelsenbeck, 1991; Wilkinson and Benton, 1995) proposed
that missing data could reduce the reliability of tree topology.
However, our study does not indicate this. Average support values
of three nodes in Group II are much higher in the 5SSU_COI_HSP_CYT_ITS tree (95%) than in the 3SSU_COI_HSP tree (85%), although
in each species of the 5SSU_COI_HSP_CYT_ITS tree, at least one
gene has not been sequenced. Furthermore, our AU test results
rank the topology of the 4SSU_COI_HSP_ITS tree higher than the
topology of the 4SSU_COI_HSP_CYT, although the number of taxa
with ITS sequence is much lower than the number of taxa with
CYT sequence. The number of supported (P50%) nodes is higher
in the tree inferred from ITS gene sequences, than it is in the tree
inferred from CYT gene sequences (Fig. 2). All these indicate that
ITS gene is a better marker for resolving intra-generic relationships
for Paramecium than CYT, and therefore 4SSU_COI_HSP_ITS topology is rejected by less datasets than 4SSU_COI_HSP_CYT topology.
These results match predictions from simulations of Wiens (2003),
suggesting that the missing data alone does not prevent the accurate placement of incomplete taxa missing this data. Instead, the
success of their placement depends primarily on how well they
can be placed by the datasets for which sequences are included.
4.4. Suggested sampling strategies for phylogenetic analyses
In practice, different one-gene datasets usually generate phylogenetic hypotheses of some critical branches that either lack support or are in conflict with each other, and the taxa of these
datasets are usually not fully overlapping. Using Paramecium as a
case study, we try to propose sampling strategies for future ciliate
phylogenetic analyses. At first, we should separate datasets into
two groups, i.e. one-gene datasets with complete taxon sampling
(SSU, HSP, COI), and one-gene datasets with incomplete datasets
(CYT, ITS).The follow-up strategies will vary, depending on these
two groups. (1) For complete datasets, increasing the number of
analyzed genes is suggested in order to improve phylogenetic
accuracy, even when one-gene tree topologies are in conflict with
each other. We hope that further studies will give a suggested gene
number for a general ciliate topology. (2) For incomplete datasets,
we suggest that resolution power and similar topologies of gene
markers but not taxa sampling size are more important criteria
for the inclusion in the concatenated dataset.
Acknowledgments
This work was supported by the Natural Science Foundation of
China (Project Nos. 41006098, 31030059, 31222050, 41176119),
148
Z. Yi et al. / Molecular Phylogenetics and Evolution 71 (2014) 142–148
the Research Fund for the Doctoral Program of Higher Education
Project No. 20104407120006), China Postdoctoral Science Foundation (Project No. 20110491623), Foundation for Distinguished
Young Talents in Higher Education of Guangdong, China (Project
No. LYM10060, 2012LYM_0049).
Appendix A. Supplementary material
Supplementary data associated with this article can be found, in
the online version, at http://dx.doi.org/10.1016/j.ympev.2013.
11.013.
References
Bapteste, E., Brinkmann, H., Lee, J.A., Moore, D., Sensen, C., Gordon, P., Durufle, L.,
Gaasterland, T., Lopez, P., Müller, M., Philippe, H., 2002. The analysis of 100
genes supports the grouping of three highly divergent amoebae: Dictyostelium,
Entamoeba, and Mastigamoeba. Proc. Nat. Acad. Sci. 99, 1414–1419.
Barth, D., Przyboś, E., Fokin, S., Schlegel, M., Berendonk, T., 2008. Cytochrome b
sequence data suggest rapid speciation within the Paramecium aurelia species
complex. Mol. Phylogenet. Evol. 49, 669–673.
Boscaro, V., Fokin, S., Verni, F., Petroni, G., 2012. Survey of Paramecium duboscqui
using three markers and assessment of the molecular variability in the genus
Paramecium. Mol. Phylogenet. Evol. 65, 1004–1013.
Diller, W.F., Earl, P.R., 1958. Paramecium jenningsi, n. sp. J. Eukaryot. Microbiol. 5,
155–158.
Dunthorn, M., Foissner, W., Katz, L.A., 2011. Expanding character sampling for
ciliate phylogenetic inference using mitochondrial SSU-rDNA as a molecular
marker. Protist 162, 85–99.
Fokin, S.I., Przyboś, E., Chivilev, S.M., 2001. Nuclear reorganization variety in
Paramecium (Ciliophora: Peniculida) and its possible evolution. Acta Protozool.
40, 249–261.
Fokin, S.I., Przybos, E., Chivilev, S.M., Beier, C.L., Horn, M., Skotarczak, B., Wodecka,
B., Fujishima, M., 2004. Morphological and molecular investigations of
Paramecium schewiakoffi sp. nov. (Ciliophora, Oligohymenophorea) and
current status of distribution and taxonomy of Paramecium spp.. Eur. J.
Protistol. 40, 225–243.
Gao, K., Norell, M.A., 1998. Taxonomic revision of Carusia (Reptilia: Squamata) from
the Late Cretaceous of the Gobi Desert and phylogenetic relationships of
anguimorphan lizards. Am. Mus. Nov. 3230, 1–51.
Gentekaki, E., Lynn, D.H., 2010. Evidence for cryptic speciation in Carchesium
polypinum Linnaeus, 1758 (Ciliophora: Peritrichia) inferred from mitochondrial,
nuclear, and morphological markers. J. Eukaryot. Microbiol. 57, 508–519.
Gong, J., Kim, S., Kim, S., Min, G., Roberts, D., Warren, A., Choi, J., 2007. Taxonomic
redescriptions of two ciliates, Protogastrostyla pulchra n. g., n. comb. and
Hemigastrostyla enigmatica (Ciliophora: Spirotrichea: Stichotrichia), with
phylogenetic analyses based on 18S and 28S rRNA gene sequences. J.
Eukaryot. Microbiol. 54, 306–316.
Gong, Y., Xu, K., Zhan, Z., Yu, Y., Li, X., Villalobo, E., Feng, W., 2010. Alpha-tubulin and
small subunit rRNA phylogenies of Peritrichs are congruent and do not support
the clustering of Mobilids and Sessilids (Ciliophora, Oligohymenophorea). J.
Eukaryot. Microbiol. 57, 265–272.
Hall, T.A., 1999. BioEdit: a user-friendly biological sequence alignment editor and
analysis program Windows 95/98/NT. Nucleic Acids Symp. Ser. 41, 95–98.
Hori, M., Tomikawa, I., Przyboś, E., Fujishima, M., 2006. Comparison of the
evolutionary distances among syngens and sibling species of Paramecium.
Mol. Phylogenet. Evol. 38, 697–704.
Hoshina, R., Hayashi, S., Imamura, N., 2006. Intraspecific genetic divergence of
Paramecium bursaria and re-construction of the paramecian phylogenetic tree.
Acta Protozool. 45, 337–386.
Huelsenbeck, J.P., 1991. When are fossils better than extant taxa in phylogenetic
analysis? Syst. Zool. 40, 458–469.
Jankowski, A.W., 1969. A proposed taxonomy of the genus Paramecium Hill, 1752
(Ciliophora). Zool. Zh. 48, 30–40 (in Russian with English summary).
Jankowski, A.W., 1972. Cytogenetics of Paramecium putrinum C. et L. (1858). Acta
Protozool. 10, 285–394.
Maciejewska, A., 2007a. Molecular phylogenetics of representative Paramecium
species. Folia Biol. (Praha) 55, 1–2.
Maciejewska, A., 2007b. Relationships of new sibling species of Paramecium
jenningsi based on sequences of the histone H4 gene fragment. Eur. J.
Protistol. 43, 125–130.
Parfrey, L.W., Grant, J., Tekle, Y.I., Lasek-Nesselquist, E., Morrison, H.G., Sogin, M.L.,
Patterson, D.J., Katz, L.A., 2010. Broadly sampled multigene analyses yield a
well-resolved eukaryotic tree of life. Syst. Biol. 59, 518–533.
Philippe, H., Snell, E.A., Bapteste, E., Lopez, P., Holland, P.W.H., Casane, D., 2004.
Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol.
Biol. Evol. 21, 1740–1752.
Przyboś, E., Tarcz, S., Potekhin, A., Rautian, M., Prajer, M., 2012. A two-locus
molecular characterization of Paramecium calkinsi. Protist 163, 263–273.
Rokas, A., Carroll, S.B., 2005. More genes or more taxa? the relative contribution of
gene number and taxon number to phylogenetic accuracy. Mol. Biol. Evol. 22,
1337–1344.
Roure, B., Baurain, D., Philippe, H., 2013. Impact of missing data on phylogenies
inferred from empirical phylogenomic data sets. Mol. Biol. Evol. 30, 197–214.
Sanderson, M.J., Driskell, A.C., 2003. The challenge of constructing large
phylogenetic trees. Trends Plant. Sci. 8, 374–379.
Shimodaira, H., 2002. An approximately unbiased test of phylogenetic tree
selection. Syst. Biol. 51, 492–508.
Shimodaira, H., Hasegawa, M., 2001. CONSEL: for assessing the confidence of
phylogenetic tree selection. Bioinformatics 17, 1246–1247.
Stamatakis, A., Hoover, P., Rougemont, J., 2008. A rapid bootstrap algorithm for the
RAxML web servers. Syst. Biol. 57, 758–771.
Strüder-Kypke, M.C., Lynn, D., 2010. Comparative analysis of the mitochondrial
cytochrome c oxidase subunit I (COI) gene in ciliates (Alveolata, Ciliophora) and
evaluation of its suitability as a biodiversity marker. Syst. Biodiv. 8, 131–148.
Strüder-Kypke, M.C., Wright, A.D., Fokin, S.I., Lynn, D.H., 2000a. Phylogenetic
relationships of the subclass Peniculia (Oligohymenophorea, Ciliophora)
inferred from small subunit rRNA gene sequences. J. Eukaryot. Microbiol. 47,
419–429.
Strüder-Kypke, M.C., Wright, A.G., Fokin, S.I., Lynn, D.H., 2000b. Phylogenetic
relationships of the genus Paramecium inferred from small subunit rRNA gene
sequences. Mol. Phylogenet. Evol. 14, 122–130.
Tamura, K., Dudley, J., Nei, M., Kumar, S., 2007. MEGA4: Molecular Evolutionary
Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24, 1596–1599.
Wiens, J.J., 2003. Missing data, incomplete taxa, and phylogenetic accuracy. Syst.
Biol. 52, 528–538.
Wiens, J.J., 2006. Missing data and the design of phylogenetic analyses. J. Biomed.
Inform. 39, 34–42.
Wiens, J.J., Reeder, T.W., Fetzner, J.W., Parkinson, C.L., Duellman, W.E., 2005. Hylid
frog phylogeny and sampling strategies for species clades. Syst. Biol. 54, 719–
748.
Wilkinson, M., Benton, M.J., 1995. Missing data and rhynchosaur phylogeny. Hist.
Biol. 10, 137–150.
Wolf, Y.I., Rogozin, I.B., Koonin, E.V., 2004. Coelomata and not ecdysozoa: evidence
from genome-wide phylogenetic analysis. Genome Res. 14, 29–36.
Woodruff, L.L., 1921. The structure, life history, and intrageneric relationships of
Paramecium calkinsi, sp. nov. Biol. Bull. 41, 171–180.
Yi, Z., Song, W., 2011. Evolution of the order Urostylida (Protozoa, Ciliophora): new
hypotheses based on multi-gene information and identification of localized
incongruence. PLoS ONE 6, e17471, doi:17410.11371/journal.pone.0017471.
Yi, Z., Dunthorn, M., Song, W., Stoeck, T., 2010. Increasing taxon sampling using both
unidentified environmental sequences and identified cultures improves
phylogenetic inference in the Prorodontida (Ciliophora, Prostomatea). Mol.
Phylogenet. Evol. 57, 937–941.