Using Supermatrices for Phylogenetic Inquiry: An

Syst. Biol. 62(2):205–219, 2013
© The Author(s) 2012. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved.
For Permissions, please email: [email protected]
DOI:10.1093/sysbio/sys088
Advance Access publication October 26, 2012
Using Supermatrices for Phylogenetic Inquiry: An Example Using the Sedges
CODY E. HINCHLIFF∗
AND
ERIC H. R OALSON
School of Biological Sciences, Washington State University, Pullman, WA 99164-4236, USA
∗ Correspondence to be sent to: Department of Ecology and Evolutionary Biology, University of Michigan, 2071A Kraus Natural Science Building, 830 N
University, Ann Arbor, MI 48109-1048, USA; E-mail: [email protected].
Received 29 February 2012; reviews returned 14 June 2012; accepted 15 October 2012
Associate Editor: Michael A. Charleston
Abstract.—In this article, we use supermatrix data-mining methods to reconstruct a large, highly inclusive phylogeny of
Cyperaceae from nucleotide data available on GenBank. We explore the properties of these trees and their utility for
phylogenetic inference, and show that even the highly incomplete alignments characteristic of supermatrix approaches
may yield very good estimates of phylogeny. We present a novel pipeline for filtering sparse alignments to improve
their phylogenetic utility by maximizing the partial decisiveness of the matrices themselves through a technique we call
“phylogenetic scaffolding,” and we present a new method of scoring tip instability (i.e. “rogue taxa”) based on the I statistic
implemented in the software Mesquite. The modified statistic, which we call I S , is somewhat more straightforward to
interpret than similar statistics, and our implementation of it may be applied to large sets of large trees. The largest sedge
trees presented here contain more than 1500 tips (about one quarter of all sedge species) and are based on multigene
alignments with more than 20 000 sites and more than 90% missing data. These trees match well with previously supported
phylogenetic hypotheses, but have lower overall support values and less resolution than more heavily filtered trees. Our
best-resolved trees are characterized by stronger support values than any previously published sedge phylogenies, and
show some relationships that are incongruous with previous studies. Overall, we show that supermatrix methods offer
powerful means of pursuing phylogenetic study and these tools have high potential value for many systematic biologists.
[Cyperaceae; decisiveness; megaphylogeny; PHLAWD; supermatrix.]
The sedge family, Cyperaceae, is one of the world’s
10 most speciose families of flowering plants (Stevens
2008), containing more than 5400 species (Govaerts
et al. 2007). Sedge species richness is highest in
the tropics, but sedges may be found in nearly all
angiosperm-supporting habitats across the biosphere,
spanning arctic tundra, alpine zones, temperate and
tropical forests, savannas, prairies, marshes, swamps,
and deserts (Dahlgren et al. 1985; Bruhl 1995). Sedges
are phenotypically and ecologically diverse, including
tiny ephemerals, fire-resistant tussocks, scandent vines
more than 10 m long, and submerged aquatic herbs.
The combination of their broad distribution and diverse
phenotypes make sedges a compelling system for the
study of global macroevolutionary ecology in plants
(Naczi and Ford 2008).
Modern methods exploring macroevolutionary
dynamics frequently require well-resolved, highly
complete phylogenies (Smith and Donoghue 2008;
Smith et al. 2009; Holton and Pisani 2010), which
can be prohibitively expensive to obtain via directed
sequencing of individual loci. Fortunately, public
resources, such as GenBank contain an abundance of
existing data that can be used for this purpose. Here, we
take advantage of supermatrix data-mining methods
to reconstruct the most inclusive sedge phylogeny
yet published, using nucleotide data gathered from
GenBank via a novel data pipeline that relies on the
software PHLAWD (Smith and Donoghue 2008; Smith
et al. 2009). PHLAWD implements a protocol described
in its documentation as “megaphylogeny”—an iterative
process involving the extraction and alignment of
nucleotide sequences, which results in fully aligned
matrices. PHLAWD relies on the software Muscle
(Edgar 2004) and MAFFT (Katoh et al. 2005) to perform
recursive profile alignments that attempt to maximize
site homology even at deep evolutionary scales. The
pipeline we present in this article further improves the
utility of the resulting alignments through a series of
complementary refinement steps that filter taxa and
genes to improve taxon coverage and branch stability.
Supermatrix methods offer a variety of advantages,
including, perhaps most significantly, the ability to
reconstruct phylogeny at broad scales with minimal
investment in sequencing (McMahon and Sanderson
2006; de Queiroz and Gatesy 2007; Ren et al. 2009;
Davis et al. 2010; van der Linde et al. 2010; Wolsan
and Sato 2010), provided that the requisite sequence
data are already available. These methods often present
their own challenges, however, including issues of
computational power and time efficiency, contentious
questions regarding the effects of very sparse alignments
on branch lengths (Lemmon et al. 2009; Wiens and
Morrill 2011), and concerns about data integrity.
One important challenge is presented by the
detrimental effects that high levels of missing data may
have on tree reconstruction. A major component of this
effect has recently been examined in detail by Steel
and Sanderson (2010) and Sanderson et al. (2010). These
papers defined and explored a property of data matrices
that they named “partial decisiveness” (PDC), which is
a practical measure of the limits of a multilocus data
set to inform phylogenetic inference—PDC describes the
proportion of branches across all possible trees that may
be resolved given the pattern of missing samples. In most
real-world alignments, some branches in the treespace
cannot be resolved because of insufficient taxon overlap;
alignments containing such patterns thus have a PDC
205
[13:29 28/1/2013 Sysbio-sys088.tex]
Page: 205
205–219
206
SYSTEMATIC BIOLOGY
of < 1, and are said to be “indecisive” for those branches
that cannot be reconstructed. Supermatrices are often
characterized by very high levels of missing data, and as
such can typically be expected to have PDC values lower
than more conventional matrices. The PDC metric has
not been broadly applied across a sample of such large
combined matrices, but in practice we have observed
PDC values between 0.6 and 0.7 to be common for
unoptimized matrices compiled from public databases,
which frequently contain > 70% missing data. Although
it is not necessary to have a completely decisive matrix to
reconstruct the true tree (because all possible branches
in the treespace are not present in all trees), maximizing
PDC can nonetheless generally be expected to improve
tree resolution in the majority of real-world scenarios.
A related challenge is presented by so-called “rogue
taxa”—destabilizing tips that move around the tree
enough in replicate searches to cause the collapse of
clades and the erosion of branch support values. The
instability of a given tip may be compounded by multiple
attributes, although perhaps most frequently by (i) data
nonintegrity due to sequence misidentification and (ii)
very weak phylogenetic signal simply due to a dearth
of meaningful phylogenetic information. Identification
and removal of rogue taxa is generally expected to
help improve the resolution of trees containing such
unstable tips (Thorley and Wilkinson 1999; Thomson and
Shaffer 2010), and several methods have been presented
to accomplish this goal (Thorley and Wilkinson 1999;
Thorley and Page 2000; Smith and Dunn 2008). Here,
we present and make use of a statistic modified
from the I score implemented in Mesquite (Maddison
and Maddison 2010), which overcomes some of the
drawbacks of previously available instability metrics.
In this study, we present results from a supermatrix
phylogeny (a PHLAWD megaphylogeny) of Cyperaceae.
In this context, we address issues of data integrity
as they relate to large combined analyses, and we
explore the association between data decisiveness and
tree resolution. We compare the results of our tree
searches to results from a previously published family
level analysis of the Cyperaceae, and we show that
maximizing the number of loci and taxa using the
methods employed here may in fact yield greater power
to resolve phylogenetic trees than more traditional,
focused sequencing approaches that rely on smaller
numbers of better-sampled loci, even with a highly
incomplete alignment.
MATERIALS AND METHODS
Data Collection and Validation
Data were gathered from GenBank release 185 using
the software tool PHLAWD (Smith et al. 2009), which
searches the NCBI database for all nucleotide sequences
for species within a given taxon that match a given
text query. For instance, PHLAWD can be used to
yield an alignment of all NCBI sequences matching the
query string “internal transcribed spacer” within the
[13:29 28/1/2013 Sysbio-sys088.tex]
VOL. 62
NCBI taxon Cyperaceae. Because some nonorthologous
sequences may be returned for any given query,
PHLAWD requires a set of presupplied guide sequences
for each query. Any query results that do not match these
guide sequences by some arbitrary minimum amount
of coverage and identity are excluded. We used cutoff
proportions of 0.3 for coverage and 0.2 for identity for all
searches.
We gathered data on 23 genes for all available species
within the Cyperaceae. Figure 1 presents a heatmap of
the coverage of these data for all recognized genera. We
also ranked genera by overall coverage for these markers,
according to several statistics intended to measure the
amount of information available for each genus (Fig. 1,
bottom 4 rows; see Supplementary material for details;
Data Dryad doi: 10.5061/dryad.6p76c3pb). From these
data, we generated alignment files representing more
than 1500 species from almost all of the family’s genera.
Guide sequences for each gene were chosen arbitrarily
from the set of available GenBank data, in a manner
that sought to maximize their phylogenetic spread. The
resulting alignment files were concatenated on the basis
of species name using the software phyutility (Smith
and Dunn 2008). This approach combines available data
from multiple exemplar specimens of each species to
make phylogeny estimation at this scale possible, and it
has proven effective for estimating phylogeny of large
numbers of species when parallel data from multiple
markers are not available from a single specimen (Jones
et al. 2002; Smith and Donoghue 2008; Smith et al. 2009).
Because data integrity on public databases such as
GenBank is uncertain, it is possible to include poor
quality or incorrectly identified sequences when data
mining these resources. When multiple sequences from
different vouchers are concatenated on the basis of
species identity, the potential exists for sequences from
misidentified species to be concatenated with correctly
identified ones, leading to the creation of so-called
“chimeric” taxa. The tip nodes that are associated with
these mistaken taxa are expected to contain conflicting
phylogenetic signal which, at least in the case of
bootstrapping, can erode support values for otherwise
well-supported clades, because the chimeric tip is
either expected to be placed near each of its different
constituent species in different replicate searches, or to
contain strong enough conflicting signal that it simply
causes parts of the tree to collapse. One approach to a
posteriori identification of these taxa involves scoring
each taxon for a leaf stability index—a statistic that
quantifies how much its position changes relative to
other taxa in replicate tree searches. Those taxa with the
lowest stability are assumed to contain either conflicting
phylogenetic signal (especially in the case of chimeric
taxa) or such low levels of signal that many placements
are equivocal.
Multiple methods of measuring leaf stability have
been proposed (Thorley and Wilkinson 1999; Thorley
and Page 2000; Smith and Dunn 2008). For this study, we
built upon previous work by Maddison and Maddison
(2010) in Mesquite version 2.7. They present a statistic,
Page: 206
205–219
[13:29 28/1/2013 Sysbio-sys088.tex]
1500
1000
500
0
0
value
Color key and histogram
count
0.2
0.4
0.6
0.8
1
TOT_percent_cover
TOT_proportion_spp
ABS_importance
ABS_import_rank
trnl
trnl-trnf
rbcl
ndhf
its
ets
trnk
rpoc1
rpob
psba-trnh
atpf-atph
psbk-psbi
adh1
adh2
atpb-rbcl
accd
atp1
matk
trny-trne
atpb
trnc-ycf6
petn-psbm
rps16
HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX
FIGURE 1. Heat map of nucleotide sequence sampling in Cyperaceae. Sampling density and summary statistics for all currently recognized genera of Cyperaceae are shown for each
of the 23 most densely sampled genetic markers on GenBank release 185. Sampling density for each marker/genus combination is a proportion of the total number of species sampled for
that marker/genus divided by the total number of species recognized in that genus. Summary statistics are used to rank to relative quality of taxonomic sampling: “TOT percent cover” is
the proportion of total species sequences obtained for some genus; “TOT proportion spp” is the relative contribution of a genus to the species richness of the family; “ABS importance” is a
summary statistic that incorporates both TOT percent cover and TOT proportion spp to estimate the value of additional sequencing within some genus. “RANK importance” is the percentile
rank of each value of ABS importance and represents the relative necessity of additional sequencing investment for each genus. See Supplementary material for a more thorough explanation
including formulas used for summary statistics.
Actinoschoenus
Zameioscirpus
Microdracoides.squamosus
Volkiella.disticha
Eleocharis
Androtrichum
Oreobolopsis
Phylloscirpus
Kyllingiella
Courtoisina
Isolepis
Trichophorum
Alinula
Blysmus
Mesomelaena
Chrysitrix
Tetraria
Ficinia
Carex
Eriophorum
Calyptrocarya
Trianoptiles
Cyathochaeta
Costularia
Gahnia
Schoenus
Fimbristylis
Becquerelia
Trilepis
Ascolepis
Trichoschoenus.bosseri
Sumatroscirpus.junghuhnii
Rhynchocladium.steyermarkii
Reedia.spathacea
Pseudoschoenus.inanus
Principina.grandis
Nelmesia.melanostachya
Koyamaea.neblinensis
Everardia
Cypringlea
Cephalocarpus
Afrotrilepis
Bisboeckelera
Lepidosperma
Bulbostylis
Oxycaryum
Rhynchospora
Scleria
Lagenocarpus
Machaerina
Kyllinga
Lipocarpha
Mapania
Hypolytrum
Cyperus
Pycreus
Carpha
Gymnoschoenus
Bolboschoenus
Schoenoplectus
Schoenoplectiella
Chorizandra
Evandra
Morelotia
Scirpus
Tricostularia
Epischoenus
Coleochloa
Paramapania
Fuirena
Diplacrum
Caustis
Scirpodendron
Capitularina.involucrata
Exocarya.sclerioides
Hellmuthia.membranacea
Scirpoides
Remirea.maritima
Sphaerocyperus.erinaceus
Oreobolus
Arthrostylis.aphylla
Amphiscirpus
Actinoscirpus
Dulichium.arundinaceum
Ptilothrix.deusta
Lepironia.articulata
Neesenbeckia.punctoria
Diplasia.karatifolia
Cladium
Trachystylis.stradbrokensis
Capeobolus.brevicaulis
Khaosokia.caricoides
2013
207
Page: 207
205–219
208
SYSTEMATIC BIOLOGY
I, which summarizes taxon movement among a set of
trees, using differences in patristic distance between all
taxon pairs across all pairs of trees in the set. Each
difference in distance for a pair of taxa i and j between
each pair of trees x and y is scaled to the sum of
the patristic distance between i and j across both trees
x and y, and these are summed across all trees. The
Mesquite implementation of this method also requires
an alignment to be loaded because the instability scores
for each taxon are plotted against the percentage of
missing data for that taxon in the alignment, but this
requirement is inconvenient when the comparison of
instability to missing data is not of interest. An additional
drawback is that because the magnitude of the scores
depends on the number of trees and the number of
taxa, direct comparison of I scores generated from
different distributions of trees is not straightforward.
It is also not feasible to perform these analyses in
Mesquite using very large phylogenies because of
memory restrictions. To overcome these challenges, we
wrote a Python (http://www.python.org; last accessed
November 2012) program that can take advantage
of parallel multiprocessing architectures and large
amounts of memory to rapidly calculate instability
scores even for large sets of very large trees. This script
generates scores we have termed I S , which differ from I
in that they are instead scaled to the random expectation
for taxon movement in the provided set of trees. Thus,
for all combinations {x,y} of trees x and y in some set
of trees X (e.g. a set of bootstrap replicates or a Bayesian
posterior distribution), and all taxa j = i in these trees, the
summary statistic IiS for a given taxon i is calculated as:
2
|Dijx −Dijy |
IiS =
{xy} i =j
DRe ·n(n−1)
where Dija is the unweighted patristic distance between
taxa i and j in tree a, and DRe is the random expectation
for this distance, which is calculated by taking the
average per-tree value of the numerator in the above
equation on a subsample of trees from X with their
tips randomized, and n is the number of trees in
X. This statistic has the advantage of being directly
interpretable as a multiple of the random expectation
for taxon movement, and as such it may be compared
directly even between different distributions of trees
with different sets of taxa. A score of IjS = 0.7, for
instance, means that taxon j moves ∼ 70% of the random
expectation for taxon movement relative to all the other
taxa across a given set of trees. A taxon with a score of
1.4 would, therefore, move twice this much.
Data Filtering and Heuristic Searches
We used 2 strategies to subsample the data gathered
by phlawd to improve the resolution of resulting trees.
The first strategy consisted of removing the top 10%
of taxa with the highest I S scores to improve branch
[13:29 28/1/2013 Sysbio-sys088.tex]
VOL. 62
stability, the second of excluding taxa lacking sequences
for both of 2 loci that are broadly sampled and contain
strong phylogenetic signal for deep nodes in the tree;
these are ndhF and rbcL. This second approach, which
we refer to as phylogenetic “scaffolding,” improves the
PDC of the matrix by maximizing taxon coverage for
the selected markers, and relies on the assumption that
those markers contain sufficient signal and sampling
to reconstruct major relationships (i.e. deep nodes) in
the tree. To independently assess levels of phylogenetic
signal across our data set, we estimated phylogenetic
informativeness profiles (Townsend and López-Giráldez
2010; Townsend and Leuenberger 2011) for the most
broadly sampled loci in this study (Fig. 2), using
the PhyDesign website (López-Giráldez and Townsend
2011). Informativeness profiles indicate the predicted
phylogenetic signal contained by each locus as a function
of tree depth, and can be useful for assessing the utility of
loci to reconstruct relationships in different parts of the
tree. Although rbcL has overall lower informativeness
throughout the entire tree than ndhF (Fig. 2), it is the
most broadly sampled marker across the family and we
justified its use as part of the scaffold because it allows
the inclusion of many rare taxa that would otherwise be
absent, with a minimal decrease in decisiveness.
To assess the relative utility of both of our filtering
approaches, we conducted 4 parallel rounds of inference
using data sets filtered by either (i) neither method (i.e.
raw concatenated PHLAWD alignments), (ii) filtering
rogue taxa only, (iii) filtering taxa lacking sequences
for elected loci only, or (iv) both methods (Table 1).
We performed maximum likelihood (ML) bootstraps
to estimate phylogenies from these alignments using
RAxML version 7.2.6 (Stamatakis 2006). For each
alignment, a 300-replicate rapid bootstrap heuristic
search (RAxML’s “-f a” option) was performed. Finally,
we calculated descriptive statistics about the alignments
and the resulting trees to compare our filtering methods’
efficacy for improving tree resolution.
For comparison, we also reconstructed the Cyperaceae
family alignment used in Muasya et al. (2009), which
consisted of rbcL and trnL-trnF data. We applied the
same heuristic search methods to this alignment and
evaluated the resulting trees under the same criteria.
The sequence data were gathered directly from GenBank
release 185 using the accession ids listed in Muasya et
al. (2009). Alignment was performed in Muscle v3.7
(Edgar 2004), with subsequent manual adjustments by
eye. The trnL-trnF region is highly polymorphic at this
phylogenetic depth and some regions were excluded
because of difficulty assessing homology, as in the
original paper. Using these methods, we were able to
collect and align sequence data for 253 of the original
262 taxa.
RESULTS
A heatmap of sampling density for select markers
across all currently recognized Cyperaceae genera is
Page: 208
205–219
2013
209
4.1
2.1
0
ndhf
rbcl
rps16
8.2
6.2
4.1
2.1
0
trnl-trnf
-7
6.2
10
8.2
Net Phy. Inform.
matk
-7
its
10
atpb
Net Phy. Inform.
HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX
8.2
6.2
4.1
2.1
0
0.8
0.6
0.4
Tree depth
0.2
0
0.8
0.6
0.4
Tree depth
0.2
0
0.8
0.6
0.4
Tree depth
0.2
0
FIGURE 2. Phylogenetic informativeness profiles for the most broadly sampled loci in the combined PHLAWD alignment, generated using
PhyDesign. Labeled panels contain the infomativeness profile for the indicated gene, while the bottom-middle panel contains all profiles and the
bottom right panel contains the chronogram used by the informativeness algorithm. Shapes in the tree represent the same clades as in Figure 3.
TABLE 1.
Descriptive statistics for alignments and trees resulting from each of four combined approaches to data filtering, and an
rbcL/trnL-trnF alignment reconstructed from Muasya et al. 2009
Method
Tips
Sites
Missing
Decisiveness
Resolved
> 0.95
> 0.7
> 0.6
0. trnL-F+rbcL tree
1. All tips/rogues not filtered (AT/NF)
2. All tips/rogues filtered (AT/RF)
3. Scaffold taxa only/rogues not filtered (SC/NF)
4. Scaffold taxa only/rogues filtered (SC/RF)
253
1526
1366
484
435
3006
20017
19425
16025
16016
0.46
0.89
0.88
0.79
0.79
0.95
0.74
0.76
0.86
0.86
0.49
0.57
0.64
0.71
0.76
0.31
0.31
0.33
0.41
0.42
0.77
0.72
0.74
0.75
0.76
0.91
0.86
0.87
0.87
0.86
Notes: Row 0 corresponds to this rbcL/trnL-trnF alignment. Rows 1–4 correspond to various filtering methods applied to the PHLAWD alignment:
1. No filtering, all available data for the markers chosen at the data-mining step are used. 2. Filtering of unstable tips only—all sequence data
from the 90% most stable tips are used. 3. Filtering of less informative loci only—taxa not represented by at least an rbcL or ndhF sequence are
removed, all other available markers for the remaining taxa retained. 4. Filtering of loci and taxa—all available loci for the 90% most stable taxa
represented by at least ndhF or rbcL are retained. All decimal values are proportions. The “Tips” and “Sites” columns show the actual count
of tips and sites in the alignments; “Missing” is the proportion of missing data across all cells; “Decisiveness” is the PDC metric; “Resolved”
shows the proportion of resolved nodes in the majority rule consensus trees calculated from the ML bootstrap tree sets, and the last 3 columns
represent the proportion of these resolved nodes bearing a bootstrap support value greater than or equal to the respective cutoff value.
presented in Figure 1. Only a small handful of loci (those
at the top of the matrix) show reasonably broad sampling
across the family. The broadest sampled are ndhF and
rbcL. The second to last row, ABS_importance, contains
values of a summary statistic we developed to quantify
taxon coverage for each genus, for the markers used in
this study (see Supplementary material for formulas).
Because all genera have been poorly sampled for most
of these markers, all their importance scores are quite
similar. We therefore calculated the percentile rank of
each importance score, which are presented in the last
row, ABS_import_rank. High scores in this row indicate
those genera with the lowest levels of sampling. This
[13:29 28/1/2013 Sysbio-sys088.tex]
heatmap was generated using the gplots package in R
(Warnes et al. 2011).
Table 1 presents statistics describing the alignments
and corresponding ML trees constructed from the raw
and filtered phlawd alignments and also the rbcL +
trnL-trnF alignment reconstructed from Muasya et al.
(2009). Despite having very high decisiveness, the
rbcL +trnL-trnF alignment performed similarly to the
concatenated, unfiltered PHLAWD alignment, and was
actually surpassed by the raw PHLAWD data in terms
of total proportion of nodes resolved, by ∼ 10%. For the
filtered PHLAWD alignments, each increase in filtering
stringency was associated with further improvement to
Page: 209
205–219
210
SYSTEMATIC BIOLOGY
the resolution of the resulting consensus trees (Fig. 3), as
evidenced by the increase in the proportion of resolved
nodes (Table 1). Although the frequency of nodes
with > 0.6 bootstrap proportion (BP) varied little as filter
strictness increased, the proportion of nodes with 0.7 or
greater BP showed a steady increase. The most dramatic
improvement was seen when taxa not represented
by either an rbcL or ndhF sequence were excluded
from the alignment—the proportion of nodes resolved
with > 0.95 BP increased by 8%, with a concomitant and
presumably related 10% increase in the decisiveness of
the matrix. The best-resolved and best-supported tree
(Figs. 4–7) was generated from the most strictly filtered
alignment (Table 1, row 4), which contained only the
90% most stable tips from the alignment consisting only
of those taxa represented by a sequence for at least one
of the loci ndhF and rbcL.
The majority-rule consensus topologies from the ML
bootstrap searches corresponding to rows 1–4 of Table 1
are presented in Figure 3. The trees are presented
without taxon names or support values because space
constraints preclude legible presentation of these data
in printed form (higher-resolution versions of trees
1–3 with legible taxon names and support values are
available in Supplementary material, respectively; Data
Dryad doi: 10.5061/dryad.6p76c3pb). Figure 3 clearly
shows an overall increasing trend in the proportion
of resolved nodes that accompanied more stringent
filtering methods. The best-resolved consensus tree,
corresponding to the most heavily filtered alignment
(Table 1, row 4), is presented in Figures 4–7. Support
values throughout most of the tree are high, especially
for deep nodes, and relationships among major lineages
of sedges are generally well-resolved, with strong
support for the monophyly of the 2 subfamilies
Mapanioideae and Cyperoideae, and generally strong
support for most previously recognized major clades.
The monophyly of many currently recognized genera,
however, is not supported, in agreement with previous
findings from the studies that generated the data we
gathered from GenBank.
DISCUSSION
Data Decisiveness and Tree Reconstruction
PDC (Sanderson et al. 2010; Steel and Sanderson
2010) provides a general method of assessing the effect
of missing data on tree reconstruction methods, and
it is of interest for supermatrix data mining because
maximizing PDC is likely to improve phylogenetic
resolution. This metric is independent from measures of
phylogenetic signal; instead it describes the proportion
of branches, out of the set of all branches in all possible
trees, which may be reconstructed given the patterns of
taxon overlap among markers in the data set. Sampling
matrices with high decisiveness may be used to
reconstruct most or all branches in many trees, whereas
those with low decisiveness may lack the taxon overlap
[13:29 28/1/2013 Sysbio-sys088.tex]
VOL. 62
to even allow resolution of many or all branches, even if
the sampled markers have strong phylogenetic signal.
Because missing data can have such strong effects
on tree searching, decisiveness should be of interest
to all systematists working with even moderately
incomplete data sets, and particularly so for those
using very large, sparse data sets. However, achieving
complete decisiveness is not necessarily a requirement
for resolving the best tree (or set of trees), because under
the common assumption that there are relatively few
best trees (or just one), only a tiny fraction of all possible
branches is represented in them (at least for large trees).
As long as patterns of missing data do not preclude the
resolution of these “good” branches, very good trees may
be found using matrices with PDC considerably < 1, as
we have demonstrated here.
The PDC metric itself actually measures the
proportion of 4-way combinations (quartets) of taxa
that are represented by homologous data for at least a
single locus in the alignment. Matrices within which
every quartet is represented by a set of homologous
sequences for at least one locus are fully decisive. Steel
and Sanderson (2010) called this the 4-way partition
property. The phylogenetic scaffolding approach we
used relies on this property: it maximizes the proportion
of homologous quartets/triplets by excluding taxa not
represented for some subset of loci—the scaffold
backbone. When a single locus is used as a backbone,
scaffolding results in a completely decisive matrix,
because the 4-way partition property is met through
the presence of all taxa at that locus. We used 2 wellsampled—but not completely sampled—loci to form
a backbone, which does not necessarily result in a
completely decisive matrix, but has the advantage of
allowing more tips to be included.
Phylogenetic Informativeness and Scaffolding
An additional advantage of phylogenetic scaffolding
is that it provides a means to robustly reconstruct
deep nodes in the tree (using the backbone loci), while
allowing the inclusion of faster evolving markers to
inform shallower relationships, even when sampling
does not overlap among clades. Identifying good
backbone markers requires consideration not only of
how broadly sampled the candidate markers are but
also the depths at which they will be maximally
informative. One metric that may be helpful for selecting
scaffold backbones is the phylogenetic informativeness
of Townsend and Leuenberger (2011) and Townsend and
López-Giráldez (2010), which quantifies the predicted
phylogenetic signal for a given marker as a function of
tree depth.
As shown in Figure 2, some loci that are broadly
sampled, such as ITS and trnL-trnF, are characterized
by having informativeness peaks at relatively shallow
depths. The peak informativness is presumably the point
at which the sequence data reach saturation in terms
of number of changes, and beyond which their signal
Page: 210
205–219
2013
HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX
a)
b)
c)
d)
211
FIGURE 3. Majority-rule consensus trees resulting from parallel 300-replicate ML bootstrap searches performed on each of the alignments
summarized in Table 1. a)–d) correspond to rows 1–4 from Table 1, respectively. Although tip and branch labels could not be included due to
space constraints, several topological landmark nodes corresponding to several major clades are labeled. The star labels the genus Carex, the
square Eleocharis, the circle the tribe Cypereae, and the triangle the tribe Schoeneae. The dotted lines in c) and d) indicate an area of the tree
that apparently experienced a decrease in resolution as a result of more stringent rogue taxon filtering, although as discussed in the text, this
apparent loss of resolution may actually represent an increase in phylogenetic accuracy.
is replaced by noise. Markers showing this pattern are
presumed to be more informative for shallow nodes
than deep ones. Loci exhibiting this pattern are of
suspect utility for use as a scaffold backbone, but may
be included in alignments containing strong signal in
other loci for deep nodes. In the context of this study,
loci such as rbcL and ndhF have lower predicted overall
signal than some more rapidly evolving markers, but
because they remain unsaturated deep in the tree, they
are presumed to contain more information to resolve
deep nodes (Fig. 2). Unsaturated loci such as these
are better candidates for use as backbones. We have
observed, however, that high predicted signal does not
necessarily equate with high resolution even in clades
[13:29 28/1/2013 Sysbio-sys088.tex]
for which a locus is well-sampled (e.g. trnL-trnF for
the genus Carex). The relatively low performance of the
rbcL+trnL-trnF alignment (Table 1, row 1), despite the
exceptional predicted signal for trnL-trnF, provides an
example of this (Fig. 2).
Rogue Taxon Filtering
Filtering out the 10% least stable taxa from the mostinclusive alignment (Table 1, row 2) resulted in a 7%
increase in the proportion of resolved nodes, whereas
filtering the 10% least stable taxa from the scaffolded
alignment resulted in a 5% increase. These differences
Page: 211
205–219
212
SYSTEMATIC BIOLOGY
VOL. 62
FIGURE 4. Majority rule consensus tree resulting from a 300-replicate ML search performed on the alignment corresponding to Table 1, row
4. Branch labels indicate BP. Major clades and grades of the Cyperaceae are labeled to the right. D. = Dulichieae; Fuir. grade = Fuireneae grade;
T. = Trilepideae. Each labeled group occurs only once in the tree, but it may extend across multiple pages. In these cases, the parts of the group
on different pages are labeled individually.
[13:29 28/1/2013 Sysbio-sys088.tex]
Page: 212
205–219
2013
213
HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX
To Fig. 6
71.67
76.67
69.67
95
60.67
77.67
50
99.67
99.67
98
56.33
99.33
98.67 93.67
95.67
82
97
100
99
91.67
98.67
98.67
80
100
70
100
91.33
71.33
93.33
94
100
84
75.67
61.67
64
98
75.67
63
62
100
88.67
78.33
59.33
76.67
83.67
96.67
72
99.67
87
52.33
67.67
97
51.33
83.33
53.67
FIGURE 5.
98.33
59
D.
To Fig. 4
58.33
Scirpeae
95
94.33
Carex_pairae
Carex_divulsa
Carex_otrubae
Carex_lamprocarpa
Carex_echinata
Carex_lachenalii
Carex_canescens
Carex_marina
Carex_ovalis
Carex_xerantica
Carex_chordorrhiza
Carex_cephalophora
Carex_conferta
Carex_maritima
Uncinia_hamata
Uncinia_phleoides
Uncinia_nemoralis
Uncinia_uncinata
Uncinia_filiformis
Carex_obtusata
Carex_rupestris
Carex_monostachya
Carex_elynoides
Carex_nardina
Carex_capillacea
Carex_leptalea
Kobresia_prattii
Kobresia_capillifolia
Carex_backii
Cymophyllus_fraserianus
Kobresia_simpliciuscula
Carex_pauciflora
Kobresia_fragilis
Kobresia_laxa
Schoenoxiphium_filiforme
Schoenoxiphium_burkei
Schoenoxiphium_sparteum
Schoenoxiphium_ecklonii
Schoenoxiphium_ludwigii
Carex_camptoglochin
Carex_microglochin
Carex_peregrina
Carex_pulicaris
Eriophorum_brachyantherum
Eriophorum_chamissonis
Eriophorum_angustifolium
Eriophorum_vaginatum
Scirpus_flaccidifolius
Scirpus_hattorianus
Scirpus_georgianus
Scirpus_ancistrochaetus
Scirpus_polystachyus
Scirpus_karuizawensis
Scirpus_mitsukurianus
Scirpus_wichurae
Scirpus_fontinalis
Scirpus_pendulus
Scirpus_atrocinctus
Scirpus_cyperinus
Scirpus_microcarpus
Scirpus_sylvaticus
Scirpus_expansus
Scirpus_radicans
Scirpus_orientalis
Phylloscirpus_acaulis
Phylloscirpus_deserticola
Phylloscirpus_bolivianus
Amphiscirpus_nevadensis
Trichophorum_alpinum
Trichophorum_pumilum
Trichophorum_planifolium
Trichophorum_clintonii
Oreobolopsis_clementis
Trichophorum_dioicum
Oreobolopsis_tepalifera
Oreobolopsis_inversa
Trichophorum_rigidum
Trichophorum_subcapitatum
Zameioscirpus_muticus
Zameioscirpus_atacamensis
Trichophorum_cespitosum
Eriophorum_crinigerum
Khaosokia_caricoides
Dulichium_arundinaceum
Blysmus_compressus
Carex
73.33
See legend to Figure 4.
may seem small, but visual inspection of the resulting
consensus trees (Fig. 3, compare 3a vs. 3b and 3c vs. 3d)
confirms that they represent a meaningful increase in
resolution. The newly resolved nodes are concentrated in
areas of the tree that were mostly to entirely unresolved
by the unfiltered alignments. Some areas of the tree
[13:29 28/1/2013 Sysbio-sys088.tex]
actually experienced an apparent decrease in resolution
as a result of stringent rogue filtering (see dotted
outlines in Fig. 3d), but the nodes that were lost were
characterized by relatively low support values, and with
their removal, support values for surrounding nodes
increased.
Page: 213
205–219
214
FIGURE 6.
SYSTEMATIC BIOLOGY
VOL. 62
See legend to Figure 4.
[13:29 28/1/2013 Sysbio-sys088.tex]
Page: 214
205–219
2013
FIGURE 7.
HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX
215
See legend to Figure 4.
[13:29 28/1/2013 Sysbio-sys088.tex]
Page: 215
205–219
216
SYSTEMATIC BIOLOGY
Several leaf instability indices exist, all of which can
be used to filter rogues (Thorley and Wilkinson 1999;
Thorley and Page 2000; Smith and Dunn 2008; Maddison
and Maddison 2010). They serve similar purposes
and should have essentially similar results, but their
interpretations may differ. The statistic developed by
Thorley and Wilkinson (1999), implemented in RadCon
(Thorley and Page 2000) and phyutility (Smith and Dunn
2008), is based on branch support values and may be
sensitive to them, while the I statistic employed by
Maddison and Maddison (2010) in Mesquite measures
taxon movement directly (as opposed to using support
values), but is not standardized among data sets. The
statistic I S that we employ also measures movement
directly, but creates scores that may be directly compared
among data sets. The python script implementing the
I S calculations may also be applied to larger sets of
larger trees than some of the other methods, using
parallel computation. How the I S method compares to
the graph-based approach RogueNaRok (Aberer et al.
2012) remains to be seen.
Data Filtering and Sampling Strategies
Although filtering taxa out of alignments reduces their
taxonomic inclusivity, it is a tradeoff that may be useful
for developing highly resolved trees (Kearney 2002;
Wiens 2003b; Campbell and Lapointe 2009; Thomson
and Shaffer 2010; Aberer et al. 2012). In the case
of supermatrices, even heavily filtered data sets are
often considerably more taxon inclusive than manually
generated ones. In this study, our best-resolved tree
has 435 tips—only 29% of the available Cyperaceae
operational taxonomic units (OTU) on GenBank—
but this still represents an addition of 173 OTUs (a
66% increase) when compared with the next largest
published phylogeny for the group (Muasya et al. 2009).
In addition, our trees present stronger support for
deep relationships than any previous family level tree
(Katsuyama et al. 2007; Muasya et al. 2009; Hinchliff et al.
2010), which is likely directly due to the greater number
and variety of loci used to generate them.
The sampling patterns shown in Figure 1 reflect a
composite sampling strategy that has been applied to
many organismal groups, because most phylogenetic
studies have traditionally fallen into one of 2 sampling
categories (i) many studies focus narrowly on a single
(often large, species rich) clade, which is sampled
extensively for a small number of rapidly evolving
markers that may not be shared by other studies,
while (ii) some studies take a broader sample across a
deeper taxonomic scale of one or more relatively slowly
evolving markers that are often not shared with the finerscale studies. High-throughput sequencing is likely to
influence these strategies, but there will still exist slowly
evolving markers that are broadly represented, and fast
markers specific to certain groups.
[13:29 28/1/2013 Sysbio-sys088.tex]
VOL. 62
In the case of Cyperaceae, ndhF and rbcL have been
sampled broadly by family level studies, whereas ITS,
ETS and a variety of chloroplast markers have been
used by infrageneric studies. Because of low levels
of overlap among studies, Figure 1 contains a high
proportion of blue cells representing absent data for
most locus-taxon combinations. This apparent lack of
sampling, however, is not necessarily problematic for
phylogeny reconstruction because the nature of the
phylogenetic sampling is geared specifically to address
the relationships in the most reasonable trees, though
careful selection of markers and taxa by the authors of
phylogenetic studies.
While careful addition of sequence data to sparse
alignments can dramatically improve results (van der
Linde et al. 2010; Brown J.W., unpublished data), it is
typically redundant to require complete coverage for
all markers, and prohibitively so for supermatrices. The
power of these methods is their ability to tie together
disparate but already information-rich data sets, and
the generation of full alignments is not required for
success (Wiens 2003a). This is simply because, given the
sampling patterns inherent in phylogenetic studies, only
a small subset of additional sequence data are likely
to contain additional phylogenetic information—it is
not advisable to exhaustively sample slowly evolving
markers for closely related taxa, because they will
contain little variation, nor does it makes sense to
sequence rapidly evolving markers for distantly related
taxa, because they will contain a great deal of noise.
It is worth pointing out, however, that despite our
abilities to accurately reconstruct relationships despite
high levels of missing data, sparse alignments can
(and do) negatively impact our ability to infer branch
lengths (Lemmon et al. 2009). Some disagreement exists
regarding the severity of this problem (see Wiens and
Morrill 2011), and the topic of sampling optimization
remains of interest (Yan et al. 2005; Mittelbach et al. 2007;
Weir and Schluter 2007; Svenning et al. 2008; Sanderson
et al. 2010; Thomson and Shaffer 2010; Townsend and
López-Giráldez 2010; Cho et al. 2011).
Implications for Large Combined Analyses
Results from our large-tree analyses are promising.
The best resolved trees from this study improve upon
less inclusive family level trees (Katsuyama et al. 2007;
Muasya et al. 2009; Hinchliff et al. 2010), suggesting an
important role for effective methods to deal with large,
sparse, and potentially noisy data sets As advances in
sequencing technology drive the generation of more
large, noisy data sets, these methods will become
increasingly more important. Harnessing the power of
these data sets requires an initial planning investment
to ensure data integrity and to optimize phylogenetic
resolution, but the result of successful planning and
data management is very large, very well-resolved
phylogenies, that in many cases may surpass previous
expectations about the limits of possibility.
Page: 216
205–219
2013
HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX
Although supermatrix methods may offer an
alternative to the generation of novel sequence data,
it is important to keep in mind that these approaches
are complementary. Data mining methods rely on
the adequacy of previous sampling, and many taxa
and genomes remain absent from public databases
(Fig. 1). Wet laboratory techniques fulfill the vital role
of improving the coverage of these collaborative data
sets, and the generation of novel sequence data is likely
to remain essential to progress in systematic biology for
the foreseeable future.
Systematics and Classification of Cyperaceae
Our results corroborate many previous findings
regarding sedge phylogeny, but provide new results
that contradict some earlier studies. One of the largest
differences between this and the next most-inclusive
sedge phylogeny in the literature (Muasya et al. 2009)
is that the tree we present has very strong support
for nearly all relationships among major clades. Several
differences in topology also exist. Classification units
referred to herein follow Simpson et al. (2007) and
Goetghebeur (1998). For a thorough, concise discussion
of classification, refer to Muasya et al. (2009).
In our best-resolved sedge phylogeny (Figs. 4–7),
subfamily Mapanioideae is strongly supported as sister
to the rest of the Cyperaceae, with the Trilepideae sister
to the all remaining, as in Muasya et al. (2009) (Fig. 4).
However, we find a Sclerieae+Bisboeckelereae clade
strongly supported as sister to all remaining Cyperaceae,
which is incongruous with Muasya et al. (2009)
where these tribes were nested within the Schoeneae.
A monophyletic Sclerieae+Bisboeckelereae that is
separate from Schoeneae better fits expectations based
on morphological classifications (Goetghebeur 1998)
and some previous molecular phylogenetic analyses
(Simpson et al. 2007). Like Muasya et al. (2009), however,
we resolve Didymiandrum and Lagenocarpus, members of
tribe Cryptangieae that have previously been classified
with Scleria (Goetghebeur 1998), within the Schoeneae
with very high support (Fig. 4). We find support for
this Schoeneae+Cryptangieae clade as sister to all
other remaining lineages. Although relationships within
the Schoeneae are generally unresolved, several wellsupported clades corresponding to major genera are
present. Some genera in Schoeneae appear polyphyletic,
such as Tetraria (corroborated by Verboom 2006) and
Gahnia (though this finding is incongruent with Zhang et
al. 2004). There is good support for a sister relationship
between Cladium and the rest of the tribe (Fig. 4). The
genus Rhynchospora is strongly supported as sister to all
remaining lineages and contains all present members of
the genus Pleurostachys (Fig. 4).
Strong support exists for one large additional
major clade consisting of 2 sister subclades:
(i) Cariceae+Dulichieae+Khaosokia+Scirpeae; and
(ii) Abildgaardieae+Cypereae+Eleocharis+Fuireneae
(Figs. 5–7). Within the first of these, Dulichieae
[13:29 28/1/2013 Sysbio-sys088.tex]
217
is supported in a sister relationship with a
Cariceae+Khaosokia+Scirpeae clade, but relationships
between Cariceae, Scirpeae, and the monotypic
Khaosokia caricoides are unresolved. The monophyly of
the genus Carex, if circumscribed to include Cymophyllus,
Kobresia, Schoenoxiphium, and Uncinia (i.e. the Cariceae),
is strongly supported (Figs. 5 and 6). Most species of
Scirpeae present in this data set are not represented by
sequences at rapidly evolving, data rich markers such
as ndhF and ITS, and the addition of these markers
may help clarify the relationships within this group
and among Cariceae, Khaosokia, and Scirpeae. The
monophyly of a clade containing the Abildgaardieae
and Eleocharis is well supported, as is the monophyly
of both Eleocharis and the Abildgaardieae themselves
(Fig. 6). Relationships among the genera Bolboschoenus,
Fuirena, Schoenoplectiella, and Schoenoplectus are poorly
resolved (Figs. 6 and 7), and additional studies involving
novel sequencing effort will likely be necessary to shed
light on this problematic area. Within the Cypereae,
a strongly supported sister relationship is indicated
between (i) a Ficinia + Isolepis clade in which these 2
genera form nearly monophyletic groups, and (ii) a
clade in which Scirpoides is sister to a clade containing
a highly paraphyletic Cyperus with numerous other
genera interdigitated throughout it (Fig. 7).
Relationships within major clades are in general less
well-resolved than those in the studies from which these
data were originally published (Roalson and Friar 2000;
Yen and Olmstead 2000; Muasya et al. 2001; Roalson
et al. 2001; Muasya et al. 2002; Yano et al. 2004; Zhang
et al. 2004; Chacón et al. 2006; Verboom 2006; Ghamkhar
et al. 2007; Katsuyama et al. 2007; Simpson et al. 2007;
Hinchliff and Roalson 2009; Muasya et al. 2009; Starr
and Ford 2009; Thomas et al. 2009; Waterway et al. 2009;
Hinchliff et al. 2010; Roalson et al. 2010). This may be due
to decreased levels of decisiveness for shallow branches
because of the addition of samples (such as outgroup
taxa) that bridge taxonomic gaps among studies but
lack overlap for sampled genes. Low resolution near the
tips may also result from lower amounts of phylogenetic
information available to inform shallow splits, because
of the smaller numbers of taxa (and therefore fewer
sampled loci) that can be used.
We find strong support for recognizing the following
clades (disregarding the taxonomic rank at which they
are recognized): Mapanioideae, Trilepideae, Sclerieae
(including Bisboeckelereae), Schoeneae (including
Cryptangieae),
Rhynchosporeae,
Abildgaardieae
(including Arthrostylideae), and Eleocharideae
(Figs. 4–7). The Scirpeae+Cariceae clade (Figs. 5 and 6),
and the Fuireneae+Cypereae clade (Figs. 6 and 7), each
contain poorly resolved regions, which may contain
clades that conflict with current classification units. The
key question regarding the Scirpeae+Cariceae clade is
whether the Scirpeae is monophyletic or forms a grade
leading to Cariceae; both of these patterns have been
resolved in published studies (grade: Hinchliff et al.
2010; clade, including Dulichieae: Muasya et al. 2009),
although never with strong support. The phylogeny we
Page: 217
205–219
218
present suggests the presence of 4 lineages, if Eriophorum
crinigerum groups with the rest of Scirpeae: the
Dulichieae (Dulichium + Blysmus), Khaosokia, Scirpeae,
and Cariceae. More targeted work on the Scirpeae will
be necessary to clarify this. The Fuireneae+Cypereae
clade presents a similar problem: the monophyletic
Cypereae contains 2 well-supported clades (Cyperus
s.l. and Ficinia/Isolepis), but the taxa usually attributed
to the Fuireneae form a polytomy below Cypereae
(Figs. 6 and 7). Previous studies have seen these lineages
positioned in many different locations, usually without
strong support (Simpson et al. 2007; Muasya et al. 2009),
but a study using ndhF and psbB-psbH (Hinchliff et
al. 2010) showed strong support for a Fuireneae grade
leading to the Cypereae. Additional sampling of ndhF
and other data-rich cpDNA regions such as psbB-psbH
and perhaps matK may help clarify these relationships.
Overall, 9 clades are strongly supported and
morphologically
diagnosible
(Mapanioideae,
Trilepideae, Sclerieae, Schoeneae, Rhynchosporeae,
Abildgaardieae, Eleocharis, and Cypereae), and should
be recognized in a new classification, as previous
classifications are clearly do not define phylogenetic
lineages as we now know them. Additional research
will clarify how many diagnosible lineages will need to
be recognized within the Carex + Dulichieae + Khaosokia
+ Scirpeae clade and the Fuireneae assemblage.
CONCLUDING REMARKS
We have shown that highly incomplete alignments can
produce well-resolved, highly taxon-inclusive trees, and
that supermatrix data-mining methods that yield these
are feasible to employ. The utility of highly incomplete
alignments may be negatively impacted by missing data,
but the PDC metric of Sanderson et al. (2010) provides
a powerful way of assessing and extending these limits
to inference. One such improvement method is matrix
scaffolding: the removal of tips lacking data for one or
more backbone loci known to contain strong signal for
important nodes deep in the tree. We have shown that
this method can dramatically improve both the PDC of
a matrix and the resolution of the resulting trees. We
also present a novel statistic for identifying tips with
weak or conflicting signal, I S , based on previous work
by Maddison and Maddison (2010), but which can be
more easily interpreted across varied data sets.
The trees we present represent the largest, most
well-resolved sedge phylogenies published to date, and
have direct implications for sedge systematics and
classification. This study is exemplary of a recent but
rapidly developing trend in systematics, where large,
aggregate data sets gathered by automated tools can
be used to yield answers to longstanding questions in
biology at scales once thought impossible. As these
methods continue to mature, we can expect to see
greater data integration, more powerful tree-searching
and statistical tools, and many new opportunities for the
study of organic evolution on Earth.
[13:29 28/1/2013 Sysbio-sys088.tex]
VOL. 62
SYSTEMATIC BIOLOGY
SUPPLEMENTARY MATERIAL
Data files and/or other supplementary information
related to this paper have been deposited on Dryad
at http://datadryad.org under doi: 10.5061/dryad.
6p76c3pb.
FUNDING
This work was supported by the National Science
Foundation [DEB 1011206 to C.E.H.].
ACKNOWLEDGEMENTS
Useful discussions from WSU/UI PuRGe were
invaluable, as was specific feedback from Matt Pennell,
Luke Harmon, and Jeremiah Busch. Many thanks to
Steve Orzell and Edwin Bridges of Avon Park, FL,
for companionship and hospitality in the field that
mistakenly went unobserved in an earlier paper.
REFERENCES
Aberer A.J., Krompass D., Stamatakis A. 2012. Pruning rogue
taxa improves phylogenetic accuracy: an efficient algorithm and
webservice. Systematic Biology (in press) doi:10.1093/sysbio/
sys078.
Bruhl J.J. 1995. Sedge genera of the world: relationships and a new
classification of the Cyperaceae. Aust. Syst. Bot. 8:125–305.
Campbell V., Lapointe F.J. 2009. The use and validity of composite taxa
in phylogenetic analysis. Syst. Biol. 58(6):560–572.
Chacón J., Madriñán S., Chase M.W., Bruhl J.J. 2006. Molecular
phylogenetics of Oreobolus (Cyperaceae) and the origin and
diversification of the American species. Taxon 55(2):359–366.
Cho S., Zwick A., Regier J.C., Mitter C., Cummings M.P., Yao J.,
Du Z., Zhao H., Kawahara A.Y., Weller S., Davis D.R., Baixeras
J., Brown J.W., Parr C. 2011. Can deliberately incomplete gene
sample augmentation improve a phylogeny estimate for the
advanced moths and butterflies (Hexapoda: Lepidoptera)? Syst.
Biol. 60(6):782–796.
de Queiroz A., Gatesy J. 2007. The supermatrix approach to systematics.
Trends Ecol. Evol. 22(1):34–41.
Dahlgren R.M.T., Clifford H.T., Yeo P.F. 1985. The families of
monocotyledons: structure, evolution, and taxonomy. Berlin:
Springer-Verlag.
Davis B.W., Li G., Murphy W.J. 2010. Supermatrix and species tree
methods resolve phylogenetic relationships within the big cats,
Panthera (Carnivora: Felidae). Mol. Phylogenet. Evol. 56(1):64–76.
Edgar R.C. 2004. MUSCLE: multiple sequence alignment with high
accuracy and high throughput. Nucleic Acids Res. 32(5):1792–1797.
Ghamkhar K., Marchant A.D., Wilson K.L., Bruhl J.J. 2007. Phylogeny
of Abildgaardieae (Cyperaceae) inferred from ITS and trnL-F data.
Aliso 23:149–164.
Goetghebeur P. 1998. Cyperaceae. In: Kubitzki K., editor. The families
and genera of vascular plants. Berlin: Springer. p. 164.
Govaerts R., Simpson D.A., Bruhl J.J., Egorova T.V., Goetghebeur P.,
Wilson K.L. 2007. World checklist of Cyperaceae. London: Royal
Botanic Gardens Kew.
Hinchliff C.E., Lliully Aguilar A., Carey T., Roalson E.H. 2010. The
origins of Eleocharis (Cyperaceae) and the status of Websteria,
Egleria, and Chillania. Taxon 59(3):709–719.
Hinchliff C.E., Roalson E.H. 2009. Stem architecture in Eleocharis
subgenus Limnochloa (Cyperaceae): evidence of dynamic
morphological evolution in a group of pantropical sedges.
Am. J. Bot. 96(8):1487–1499.
Holton T.A., Pisani D. 2010. Deep genomic-scale analyses of the
metazoa reject Coelomata: evidence from single- and multigene
Page: 218
205–219
2013
HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX
families analyzed under a supertree and supermatrix paradigm.
Genome Biol. Evol. 2:310–324.
Jones K., Purvis A., MacLarnon A., Bininda-Emonds O., Simmons N.
2002. A phylogenetic supertree of the bats (Mammalia: Chiroptera).
Biol. Rev. 77:223–259.
Katoh K., Kuma K., Toh H., Miyata T. 2005. MAFFT version 5:
improvement in accuracy of multiple sequence alignment. Nucleic
Acids Res. 33:511–518.
Katsuyama T., Hirahara T., Hoshino T. 2007. Suprageneric phylogeny
of Japanese Cyperaceae based on DNA sequences from chloroplast
ndhF and 5.8 S nuclear ribosomal DNA. Acta Phytotaxon. Geobot.
58(2):57.
Kearney M. 2002. Fragmentary taxa, missing data, and ambiguity:
mistaken assumptions and conclusions. Syst. Biol. 51(2):369–381.
Lemmon A.R., Brown J.M., Stanger-Hall K., Lemmon E.M. 2009.
The effect of ambiguous data on phylogenetic estimates obtained
by maximum likelihood and Bayesian inference. Syst. Biol. 58(1):
130–145.
López-Giráldez F., Townsend J.P. 2011. PhyDesign: an online
application for profiling phylogenetic informativeness. BMC Evol.
Biol. 11(1):152.
Maddison W.P., Maddison D.R. 2010. Mesquite: a modular system
for evolutionary analysis. Vancouver, BC: University of British
Columbia.
McMahon M., Sanderson M.J. 2006. Phylogenetic supermatrix analysis
of GenBank sequences from 2228 papilionoid legumes. Syst. Biol.
55(5):818.
Mittelbach G., Schemske D.W., Cornell H.V., Allen A.P., Brown J.M.,
Bush M.B., Harrison S.P., Hurlbert A., Knowlton N., Lessios H.
2007. Evolution and the latitudinal diversity gradient: speciation,
extinction and biogeography. Ecol. Lett. 10(4):315–331.
Muasya A.M., Simpson D.A., Chase M.W. 2002. Phylogenetic
relationships in Cyperus L. sl (Cyperaceae) inferred from plastid
DNA sequence data. Bot. J. Linn. Soc. 138(2):145–153.
Muasya A.M., Simpson D.A., Chase M.W., Culham A. 2001. A
phylogeny of Isolepis (Cyperaceae) inferred using plastid rbcL and
trnL-F sequence data. Syst. Bot. 26(2):342–353.
Muasya A.M., Simpson D.A., Verboom G.A., Goetghebeur P., Naczi
R., Chase M.W., Smets E. 2009. Phylogeny of cyperaceae based on
DNA sequence data: current progress and future prospects. Bot.
Rev. 75(1):2–21.
Naczi R.F., Ford B.A. 2008. Sedges: uses, diversity, and systematics of
the Cyperaceae. Saint Louis (MO): Missouri Botanic Garden Press.
Ren F., Tanaka H., Yang Z. 2009. A likelihood look at the supermatrixsupertree controversy. Gene 441:119–125.
Roalson E.H., Columbus J.T., Friar E.A. 2001. Phylogenetic relationships
in Cariceae (Cyperaceae) based on ITS (nrDNA) and trnT-LF
(cpDNA) region sequences: assessment of subgeneric and sectional
relationships in Carex with emphasis on section Acrocystis. Syst.
Bot. 26:318–341.
Roalson E.H., Friar E.A. 2000. Infrageneric classification of Eleocharis
(Cyperaceae) revisited: evidence from the internal transcribed
spacer (ITS) region of nuclear ribosomal DNA. Syst. Bot. 25(2):
323–336.
Roalson E.H., Hinchliff C.E., Trevisan R., Da Silva C. 2010. Phylogenetic
relationships in Eleocharis (Cyperaceae): C4 photosynthesis origins
and patterns of diversification in the Spikerushes. Syst. Bot.
35(2):257–271.
Sanderson M.J., McMahon M., Steel M. 2010. Phylogenomics with
incomplete taxon coverage: the limits to inference. BMC Evol. Biol.
10(1):155.
Simpson D.A., Muasya A.M., Alves M., Bruhl J.J., Dhooge S., Chase
M.W., Furness C.A., Ghamkhar K., Goetghebeur P., Hodkinson T.,
Marchant A.D., Reznicek A.A., Nieuwborg R., Roalson E.H., Smets
E., Starr J.R., Thomas W.W., Wilson K.L., Zhang X. 2007. Phylogeny
of Cyperaceae based on DNA sequence data—a new rbcL analysis.
Aliso 23:72–83.
Smith S.A., Beaulieu J.M., Donoghue M.J. 2009. Mega-phylogeny
approach for comparative biology: an alternative to supertree and
supermatrix approaches. BMC Evol. Biol. 9:37.
Smith S.A., Donoghue M.J. 2008. Rates of molecular evolution
are linked to life history in flowering plants. Science 322(5898):
86–89.
[13:29 28/1/2013 Sysbio-sys088.tex]
219
Smith S.A., Dunn C.W. 2008. Phyutility: a phyloinformatics tool for
trees, alignments and molecular data. Bioinformatics 24(5):715–716.
Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based
phylogenetic analyses with thousands of taxa and mixed models.
Bioinformatics 22:2688–2690.
Starr J.R., Ford B.A. 2009. Phylogeny and evolution in Cariceae
(Cyperaceae): current knowledge and future directions. Bot. Rev.
75(1):110–137.
Steel M., Sanderson M.J. 2010. Characterizing phylogenetically decisive
taxon coverage. Appl. Math. Lett. 23(1):82–86.
Stevens P.F. 2008. Angiosperm Phylogeny Website. Available from:
URL http://www.mobot.org/mobot/research/APweb/. Last
accessed November 2012.
Svenning J., Borchsenius F., Bjorholm S., Balslev H. 2008. High tropical
net diversification drives the New World latitudinal gradient in
palm (Arecaceae) species richness. J. Biogeogr. 35(3):394–406.
Thomas W.W., Araújo A.C., Alves M. 2009. A preliminary molecular
phylogeny of the Rhynchosporeae (Cyperaceae). Bot. Rev. 75(1):
22–29.
Thomson R.C., Shaffer H.B. 2010. Sparse supermatrices for
phylogenetic inference: taxonomy, alignment, rogue taxa, and the
phylogeny of living turtles. Syst. Biol. 59(1):42–58.
Thorley J.L., Page R.D.M. 2000. RadCon: phylogenetic tree comparison
and consensus. Bioinformatics 16:486–487.
Thorley J.L., Wilkinson M. 1999. Testing the phylogenetic stability of
early tetrapods. J. Theor. Biol. 200(3):343.
Townsend J.P., Leuenberger C. 2011. Taxon sampling and the optimal
rates of evolution for phylogenetic inference. Syst. Biol. 60(3):
358–365.
Townsend J.P., López-Giráldez F. 2010. Optimal selection of gene and
ingroup taxon sampling for resolving phylogenetic relationships.
Syst. Biol. 59(4):446–457.
van der Linde K., Houle D., Spicer G., Steppan S. 2010. A supermatrixbased molecular phylogeny of the family Drosophilidae. Genet. Res.
92(1):25–38.
Verboom G.A. 2006. A phylogeny of the schoenoid sedges (Cyperaceae:
Schoeneae) based on plastid DNA sequences, with special reference
to the genera found in Africa. Mol. Phylogenet. Evol. 38(1):79–89.
Warnes G.R., Bolker B., Bonebakker L., Gentleman R., Huber W., Liaw
A., Lumley T., Maechler M., Magnusson A., Moeller S., Schwartz
M., Venables B. 2011. gplots package for R statistical framework.
Rochester, NY: University of Rochester.
Waterway M., Hoshino T., Masaki T. 2009. Phylogeny, species richness,
and ecological specialization in Cyperaceae tribe Cariceae. Bot. Rev.
75(1):138–159.
Weir J., Schluter D. 2007. The latitudinal gradient in recent speciation
and extinction rates of birds and mammals. Science 315(5818):
1574.
Wiens J.J. 2003a. Missing data, incomplete taxa, and phylogenetic
accuracy. Syst. Biol. 52(4):528–538.
Wiens J.J. 2003b. Incomplete taxa, incomplete characters, and
phylogenetic accuracy: is there a missing data problem? J. Vertebrate
Paleontol. 23(2):297–310.
Wiens J.J., Morrill M.C. 2011. Missing data in phylogenetic analysis:
reconciling results from simulations and empirical data. Syst. Biol.
60(5):719–731.
Wolsan M., Sato J.J. 2010. Effects of data incompleteness on the
relative performance of parsimony and Bayesian approaches in
a supermatrix phylogenetic reconstruction of Mustelidae and
Procyonidae (Carnivora). Cladistics 26(2):168–194.
Yan C., Burleigh J.G., Eulenstein O. 2005. Identifying optimal
incomplete phylogenetic data sets from sequence databases. Mol.
Phylogenet. Evol. 35(2):528–535.
Yano O., Katsuyama T., Tsubota H., Hoshino T. 2004. Molecular
phylogeny of Japanese Eleocharis (Cyperaceae) based on ITS
sequence data, and chromosomal evolution. J. Plant Res. 117:11.
Yen A., Olmstead R. 2000. Molecular systematics of Cyperaceae tribe
Cariceae based on two chloroplast DNA regions: ndhF and trnL
intron-intergenic spacer. Syst. Bot. 25(3):479–494.
Zhang X., Marchant A., Wilson K., Bruhl J.J. 2004. Phylogenetic
relationships of Carpha and its relatives (Schoeneae, Cyperaceae)
inferred from chloroplast trnL intron and trnL-trnF intergenic
spacer sequences. Mol. Phylogenet. Evol. 31(2):647–657.
Page: 219
205–219