Syst. Biol. 62(2):205–219, 2013 © The Author(s) 2012. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: [email protected] DOI:10.1093/sysbio/sys088 Advance Access publication October 26, 2012 Using Supermatrices for Phylogenetic Inquiry: An Example Using the Sedges CODY E. HINCHLIFF∗ AND ERIC H. R OALSON School of Biological Sciences, Washington State University, Pullman, WA 99164-4236, USA ∗ Correspondence to be sent to: Department of Ecology and Evolutionary Biology, University of Michigan, 2071A Kraus Natural Science Building, 830 N University, Ann Arbor, MI 48109-1048, USA; E-mail: [email protected]. Received 29 February 2012; reviews returned 14 June 2012; accepted 15 October 2012 Associate Editor: Michael A. Charleston Abstract.—In this article, we use supermatrix data-mining methods to reconstruct a large, highly inclusive phylogeny of Cyperaceae from nucleotide data available on GenBank. We explore the properties of these trees and their utility for phylogenetic inference, and show that even the highly incomplete alignments characteristic of supermatrix approaches may yield very good estimates of phylogeny. We present a novel pipeline for filtering sparse alignments to improve their phylogenetic utility by maximizing the partial decisiveness of the matrices themselves through a technique we call “phylogenetic scaffolding,” and we present a new method of scoring tip instability (i.e. “rogue taxa”) based on the I statistic implemented in the software Mesquite. The modified statistic, which we call I S , is somewhat more straightforward to interpret than similar statistics, and our implementation of it may be applied to large sets of large trees. The largest sedge trees presented here contain more than 1500 tips (about one quarter of all sedge species) and are based on multigene alignments with more than 20 000 sites and more than 90% missing data. These trees match well with previously supported phylogenetic hypotheses, but have lower overall support values and less resolution than more heavily filtered trees. Our best-resolved trees are characterized by stronger support values than any previously published sedge phylogenies, and show some relationships that are incongruous with previous studies. Overall, we show that supermatrix methods offer powerful means of pursuing phylogenetic study and these tools have high potential value for many systematic biologists. [Cyperaceae; decisiveness; megaphylogeny; PHLAWD; supermatrix.] The sedge family, Cyperaceae, is one of the world’s 10 most speciose families of flowering plants (Stevens 2008), containing more than 5400 species (Govaerts et al. 2007). Sedge species richness is highest in the tropics, but sedges may be found in nearly all angiosperm-supporting habitats across the biosphere, spanning arctic tundra, alpine zones, temperate and tropical forests, savannas, prairies, marshes, swamps, and deserts (Dahlgren et al. 1985; Bruhl 1995). Sedges are phenotypically and ecologically diverse, including tiny ephemerals, fire-resistant tussocks, scandent vines more than 10 m long, and submerged aquatic herbs. The combination of their broad distribution and diverse phenotypes make sedges a compelling system for the study of global macroevolutionary ecology in plants (Naczi and Ford 2008). Modern methods exploring macroevolutionary dynamics frequently require well-resolved, highly complete phylogenies (Smith and Donoghue 2008; Smith et al. 2009; Holton and Pisani 2010), which can be prohibitively expensive to obtain via directed sequencing of individual loci. Fortunately, public resources, such as GenBank contain an abundance of existing data that can be used for this purpose. Here, we take advantage of supermatrix data-mining methods to reconstruct the most inclusive sedge phylogeny yet published, using nucleotide data gathered from GenBank via a novel data pipeline that relies on the software PHLAWD (Smith and Donoghue 2008; Smith et al. 2009). PHLAWD implements a protocol described in its documentation as “megaphylogeny”—an iterative process involving the extraction and alignment of nucleotide sequences, which results in fully aligned matrices. PHLAWD relies on the software Muscle (Edgar 2004) and MAFFT (Katoh et al. 2005) to perform recursive profile alignments that attempt to maximize site homology even at deep evolutionary scales. The pipeline we present in this article further improves the utility of the resulting alignments through a series of complementary refinement steps that filter taxa and genes to improve taxon coverage and branch stability. Supermatrix methods offer a variety of advantages, including, perhaps most significantly, the ability to reconstruct phylogeny at broad scales with minimal investment in sequencing (McMahon and Sanderson 2006; de Queiroz and Gatesy 2007; Ren et al. 2009; Davis et al. 2010; van der Linde et al. 2010; Wolsan and Sato 2010), provided that the requisite sequence data are already available. These methods often present their own challenges, however, including issues of computational power and time efficiency, contentious questions regarding the effects of very sparse alignments on branch lengths (Lemmon et al. 2009; Wiens and Morrill 2011), and concerns about data integrity. One important challenge is presented by the detrimental effects that high levels of missing data may have on tree reconstruction. A major component of this effect has recently been examined in detail by Steel and Sanderson (2010) and Sanderson et al. (2010). These papers defined and explored a property of data matrices that they named “partial decisiveness” (PDC), which is a practical measure of the limits of a multilocus data set to inform phylogenetic inference—PDC describes the proportion of branches across all possible trees that may be resolved given the pattern of missing samples. In most real-world alignments, some branches in the treespace cannot be resolved because of insufficient taxon overlap; alignments containing such patterns thus have a PDC 205 [13:29 28/1/2013 Sysbio-sys088.tex] Page: 205 205–219 206 SYSTEMATIC BIOLOGY of < 1, and are said to be “indecisive” for those branches that cannot be reconstructed. Supermatrices are often characterized by very high levels of missing data, and as such can typically be expected to have PDC values lower than more conventional matrices. The PDC metric has not been broadly applied across a sample of such large combined matrices, but in practice we have observed PDC values between 0.6 and 0.7 to be common for unoptimized matrices compiled from public databases, which frequently contain > 70% missing data. Although it is not necessary to have a completely decisive matrix to reconstruct the true tree (because all possible branches in the treespace are not present in all trees), maximizing PDC can nonetheless generally be expected to improve tree resolution in the majority of real-world scenarios. A related challenge is presented by so-called “rogue taxa”—destabilizing tips that move around the tree enough in replicate searches to cause the collapse of clades and the erosion of branch support values. The instability of a given tip may be compounded by multiple attributes, although perhaps most frequently by (i) data nonintegrity due to sequence misidentification and (ii) very weak phylogenetic signal simply due to a dearth of meaningful phylogenetic information. Identification and removal of rogue taxa is generally expected to help improve the resolution of trees containing such unstable tips (Thorley and Wilkinson 1999; Thomson and Shaffer 2010), and several methods have been presented to accomplish this goal (Thorley and Wilkinson 1999; Thorley and Page 2000; Smith and Dunn 2008). Here, we present and make use of a statistic modified from the I score implemented in Mesquite (Maddison and Maddison 2010), which overcomes some of the drawbacks of previously available instability metrics. In this study, we present results from a supermatrix phylogeny (a PHLAWD megaphylogeny) of Cyperaceae. In this context, we address issues of data integrity as they relate to large combined analyses, and we explore the association between data decisiveness and tree resolution. We compare the results of our tree searches to results from a previously published family level analysis of the Cyperaceae, and we show that maximizing the number of loci and taxa using the methods employed here may in fact yield greater power to resolve phylogenetic trees than more traditional, focused sequencing approaches that rely on smaller numbers of better-sampled loci, even with a highly incomplete alignment. MATERIALS AND METHODS Data Collection and Validation Data were gathered from GenBank release 185 using the software tool PHLAWD (Smith et al. 2009), which searches the NCBI database for all nucleotide sequences for species within a given taxon that match a given text query. For instance, PHLAWD can be used to yield an alignment of all NCBI sequences matching the query string “internal transcribed spacer” within the [13:29 28/1/2013 Sysbio-sys088.tex] VOL. 62 NCBI taxon Cyperaceae. Because some nonorthologous sequences may be returned for any given query, PHLAWD requires a set of presupplied guide sequences for each query. Any query results that do not match these guide sequences by some arbitrary minimum amount of coverage and identity are excluded. We used cutoff proportions of 0.3 for coverage and 0.2 for identity for all searches. We gathered data on 23 genes for all available species within the Cyperaceae. Figure 1 presents a heatmap of the coverage of these data for all recognized genera. We also ranked genera by overall coverage for these markers, according to several statistics intended to measure the amount of information available for each genus (Fig. 1, bottom 4 rows; see Supplementary material for details; Data Dryad doi: 10.5061/dryad.6p76c3pb). From these data, we generated alignment files representing more than 1500 species from almost all of the family’s genera. Guide sequences for each gene were chosen arbitrarily from the set of available GenBank data, in a manner that sought to maximize their phylogenetic spread. The resulting alignment files were concatenated on the basis of species name using the software phyutility (Smith and Dunn 2008). This approach combines available data from multiple exemplar specimens of each species to make phylogeny estimation at this scale possible, and it has proven effective for estimating phylogeny of large numbers of species when parallel data from multiple markers are not available from a single specimen (Jones et al. 2002; Smith and Donoghue 2008; Smith et al. 2009). Because data integrity on public databases such as GenBank is uncertain, it is possible to include poor quality or incorrectly identified sequences when data mining these resources. When multiple sequences from different vouchers are concatenated on the basis of species identity, the potential exists for sequences from misidentified species to be concatenated with correctly identified ones, leading to the creation of so-called “chimeric” taxa. The tip nodes that are associated with these mistaken taxa are expected to contain conflicting phylogenetic signal which, at least in the case of bootstrapping, can erode support values for otherwise well-supported clades, because the chimeric tip is either expected to be placed near each of its different constituent species in different replicate searches, or to contain strong enough conflicting signal that it simply causes parts of the tree to collapse. One approach to a posteriori identification of these taxa involves scoring each taxon for a leaf stability index—a statistic that quantifies how much its position changes relative to other taxa in replicate tree searches. Those taxa with the lowest stability are assumed to contain either conflicting phylogenetic signal (especially in the case of chimeric taxa) or such low levels of signal that many placements are equivocal. Multiple methods of measuring leaf stability have been proposed (Thorley and Wilkinson 1999; Thorley and Page 2000; Smith and Dunn 2008). For this study, we built upon previous work by Maddison and Maddison (2010) in Mesquite version 2.7. They present a statistic, Page: 206 205–219 [13:29 28/1/2013 Sysbio-sys088.tex] 1500 1000 500 0 0 value Color key and histogram count 0.2 0.4 0.6 0.8 1 TOT_percent_cover TOT_proportion_spp ABS_importance ABS_import_rank trnl trnl-trnf rbcl ndhf its ets trnk rpoc1 rpob psba-trnh atpf-atph psbk-psbi adh1 adh2 atpb-rbcl accd atp1 matk trny-trne atpb trnc-ycf6 petn-psbm rps16 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX FIGURE 1. Heat map of nucleotide sequence sampling in Cyperaceae. Sampling density and summary statistics for all currently recognized genera of Cyperaceae are shown for each of the 23 most densely sampled genetic markers on GenBank release 185. Sampling density for each marker/genus combination is a proportion of the total number of species sampled for that marker/genus divided by the total number of species recognized in that genus. Summary statistics are used to rank to relative quality of taxonomic sampling: “TOT percent cover” is the proportion of total species sequences obtained for some genus; “TOT proportion spp” is the relative contribution of a genus to the species richness of the family; “ABS importance” is a summary statistic that incorporates both TOT percent cover and TOT proportion spp to estimate the value of additional sequencing within some genus. “RANK importance” is the percentile rank of each value of ABS importance and represents the relative necessity of additional sequencing investment for each genus. See Supplementary material for a more thorough explanation including formulas used for summary statistics. Actinoschoenus Zameioscirpus Microdracoides.squamosus Volkiella.disticha Eleocharis Androtrichum Oreobolopsis Phylloscirpus Kyllingiella Courtoisina Isolepis Trichophorum Alinula Blysmus Mesomelaena Chrysitrix Tetraria Ficinia Carex Eriophorum Calyptrocarya Trianoptiles Cyathochaeta Costularia Gahnia Schoenus Fimbristylis Becquerelia Trilepis Ascolepis Trichoschoenus.bosseri Sumatroscirpus.junghuhnii Rhynchocladium.steyermarkii Reedia.spathacea Pseudoschoenus.inanus Principina.grandis Nelmesia.melanostachya Koyamaea.neblinensis Everardia Cypringlea Cephalocarpus Afrotrilepis Bisboeckelera Lepidosperma Bulbostylis Oxycaryum Rhynchospora Scleria Lagenocarpus Machaerina Kyllinga Lipocarpha Mapania Hypolytrum Cyperus Pycreus Carpha Gymnoschoenus Bolboschoenus Schoenoplectus Schoenoplectiella Chorizandra Evandra Morelotia Scirpus Tricostularia Epischoenus Coleochloa Paramapania Fuirena Diplacrum Caustis Scirpodendron Capitularina.involucrata Exocarya.sclerioides Hellmuthia.membranacea Scirpoides Remirea.maritima Sphaerocyperus.erinaceus Oreobolus Arthrostylis.aphylla Amphiscirpus Actinoscirpus Dulichium.arundinaceum Ptilothrix.deusta Lepironia.articulata Neesenbeckia.punctoria Diplasia.karatifolia Cladium Trachystylis.stradbrokensis Capeobolus.brevicaulis Khaosokia.caricoides 2013 207 Page: 207 205–219 208 SYSTEMATIC BIOLOGY I, which summarizes taxon movement among a set of trees, using differences in patristic distance between all taxon pairs across all pairs of trees in the set. Each difference in distance for a pair of taxa i and j between each pair of trees x and y is scaled to the sum of the patristic distance between i and j across both trees x and y, and these are summed across all trees. The Mesquite implementation of this method also requires an alignment to be loaded because the instability scores for each taxon are plotted against the percentage of missing data for that taxon in the alignment, but this requirement is inconvenient when the comparison of instability to missing data is not of interest. An additional drawback is that because the magnitude of the scores depends on the number of trees and the number of taxa, direct comparison of I scores generated from different distributions of trees is not straightforward. It is also not feasible to perform these analyses in Mesquite using very large phylogenies because of memory restrictions. To overcome these challenges, we wrote a Python (http://www.python.org; last accessed November 2012) program that can take advantage of parallel multiprocessing architectures and large amounts of memory to rapidly calculate instability scores even for large sets of very large trees. This script generates scores we have termed I S , which differ from I in that they are instead scaled to the random expectation for taxon movement in the provided set of trees. Thus, for all combinations {x,y} of trees x and y in some set of trees X (e.g. a set of bootstrap replicates or a Bayesian posterior distribution), and all taxa j = i in these trees, the summary statistic IiS for a given taxon i is calculated as: 2 |Dijx −Dijy | IiS = {xy} i =j DRe ·n(n−1) where Dija is the unweighted patristic distance between taxa i and j in tree a, and DRe is the random expectation for this distance, which is calculated by taking the average per-tree value of the numerator in the above equation on a subsample of trees from X with their tips randomized, and n is the number of trees in X. This statistic has the advantage of being directly interpretable as a multiple of the random expectation for taxon movement, and as such it may be compared directly even between different distributions of trees with different sets of taxa. A score of IjS = 0.7, for instance, means that taxon j moves ∼ 70% of the random expectation for taxon movement relative to all the other taxa across a given set of trees. A taxon with a score of 1.4 would, therefore, move twice this much. Data Filtering and Heuristic Searches We used 2 strategies to subsample the data gathered by phlawd to improve the resolution of resulting trees. The first strategy consisted of removing the top 10% of taxa with the highest I S scores to improve branch [13:29 28/1/2013 Sysbio-sys088.tex] VOL. 62 stability, the second of excluding taxa lacking sequences for both of 2 loci that are broadly sampled and contain strong phylogenetic signal for deep nodes in the tree; these are ndhF and rbcL. This second approach, which we refer to as phylogenetic “scaffolding,” improves the PDC of the matrix by maximizing taxon coverage for the selected markers, and relies on the assumption that those markers contain sufficient signal and sampling to reconstruct major relationships (i.e. deep nodes) in the tree. To independently assess levels of phylogenetic signal across our data set, we estimated phylogenetic informativeness profiles (Townsend and López-Giráldez 2010; Townsend and Leuenberger 2011) for the most broadly sampled loci in this study (Fig. 2), using the PhyDesign website (López-Giráldez and Townsend 2011). Informativeness profiles indicate the predicted phylogenetic signal contained by each locus as a function of tree depth, and can be useful for assessing the utility of loci to reconstruct relationships in different parts of the tree. Although rbcL has overall lower informativeness throughout the entire tree than ndhF (Fig. 2), it is the most broadly sampled marker across the family and we justified its use as part of the scaffold because it allows the inclusion of many rare taxa that would otherwise be absent, with a minimal decrease in decisiveness. To assess the relative utility of both of our filtering approaches, we conducted 4 parallel rounds of inference using data sets filtered by either (i) neither method (i.e. raw concatenated PHLAWD alignments), (ii) filtering rogue taxa only, (iii) filtering taxa lacking sequences for elected loci only, or (iv) both methods (Table 1). We performed maximum likelihood (ML) bootstraps to estimate phylogenies from these alignments using RAxML version 7.2.6 (Stamatakis 2006). For each alignment, a 300-replicate rapid bootstrap heuristic search (RAxML’s “-f a” option) was performed. Finally, we calculated descriptive statistics about the alignments and the resulting trees to compare our filtering methods’ efficacy for improving tree resolution. For comparison, we also reconstructed the Cyperaceae family alignment used in Muasya et al. (2009), which consisted of rbcL and trnL-trnF data. We applied the same heuristic search methods to this alignment and evaluated the resulting trees under the same criteria. The sequence data were gathered directly from GenBank release 185 using the accession ids listed in Muasya et al. (2009). Alignment was performed in Muscle v3.7 (Edgar 2004), with subsequent manual adjustments by eye. The trnL-trnF region is highly polymorphic at this phylogenetic depth and some regions were excluded because of difficulty assessing homology, as in the original paper. Using these methods, we were able to collect and align sequence data for 253 of the original 262 taxa. RESULTS A heatmap of sampling density for select markers across all currently recognized Cyperaceae genera is Page: 208 205–219 2013 209 4.1 2.1 0 ndhf rbcl rps16 8.2 6.2 4.1 2.1 0 trnl-trnf -7 6.2 10 8.2 Net Phy. Inform. matk -7 its 10 atpb Net Phy. Inform. HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX 8.2 6.2 4.1 2.1 0 0.8 0.6 0.4 Tree depth 0.2 0 0.8 0.6 0.4 Tree depth 0.2 0 0.8 0.6 0.4 Tree depth 0.2 0 FIGURE 2. Phylogenetic informativeness profiles for the most broadly sampled loci in the combined PHLAWD alignment, generated using PhyDesign. Labeled panels contain the infomativeness profile for the indicated gene, while the bottom-middle panel contains all profiles and the bottom right panel contains the chronogram used by the informativeness algorithm. Shapes in the tree represent the same clades as in Figure 3. TABLE 1. Descriptive statistics for alignments and trees resulting from each of four combined approaches to data filtering, and an rbcL/trnL-trnF alignment reconstructed from Muasya et al. 2009 Method Tips Sites Missing Decisiveness Resolved > 0.95 > 0.7 > 0.6 0. trnL-F+rbcL tree 1. All tips/rogues not filtered (AT/NF) 2. All tips/rogues filtered (AT/RF) 3. Scaffold taxa only/rogues not filtered (SC/NF) 4. Scaffold taxa only/rogues filtered (SC/RF) 253 1526 1366 484 435 3006 20017 19425 16025 16016 0.46 0.89 0.88 0.79 0.79 0.95 0.74 0.76 0.86 0.86 0.49 0.57 0.64 0.71 0.76 0.31 0.31 0.33 0.41 0.42 0.77 0.72 0.74 0.75 0.76 0.91 0.86 0.87 0.87 0.86 Notes: Row 0 corresponds to this rbcL/trnL-trnF alignment. Rows 1–4 correspond to various filtering methods applied to the PHLAWD alignment: 1. No filtering, all available data for the markers chosen at the data-mining step are used. 2. Filtering of unstable tips only—all sequence data from the 90% most stable tips are used. 3. Filtering of less informative loci only—taxa not represented by at least an rbcL or ndhF sequence are removed, all other available markers for the remaining taxa retained. 4. Filtering of loci and taxa—all available loci for the 90% most stable taxa represented by at least ndhF or rbcL are retained. All decimal values are proportions. The “Tips” and “Sites” columns show the actual count of tips and sites in the alignments; “Missing” is the proportion of missing data across all cells; “Decisiveness” is the PDC metric; “Resolved” shows the proportion of resolved nodes in the majority rule consensus trees calculated from the ML bootstrap tree sets, and the last 3 columns represent the proportion of these resolved nodes bearing a bootstrap support value greater than or equal to the respective cutoff value. presented in Figure 1. Only a small handful of loci (those at the top of the matrix) show reasonably broad sampling across the family. The broadest sampled are ndhF and rbcL. The second to last row, ABS_importance, contains values of a summary statistic we developed to quantify taxon coverage for each genus, for the markers used in this study (see Supplementary material for formulas). Because all genera have been poorly sampled for most of these markers, all their importance scores are quite similar. We therefore calculated the percentile rank of each importance score, which are presented in the last row, ABS_import_rank. High scores in this row indicate those genera with the lowest levels of sampling. This [13:29 28/1/2013 Sysbio-sys088.tex] heatmap was generated using the gplots package in R (Warnes et al. 2011). Table 1 presents statistics describing the alignments and corresponding ML trees constructed from the raw and filtered phlawd alignments and also the rbcL + trnL-trnF alignment reconstructed from Muasya et al. (2009). Despite having very high decisiveness, the rbcL +trnL-trnF alignment performed similarly to the concatenated, unfiltered PHLAWD alignment, and was actually surpassed by the raw PHLAWD data in terms of total proportion of nodes resolved, by ∼ 10%. For the filtered PHLAWD alignments, each increase in filtering stringency was associated with further improvement to Page: 209 205–219 210 SYSTEMATIC BIOLOGY the resolution of the resulting consensus trees (Fig. 3), as evidenced by the increase in the proportion of resolved nodes (Table 1). Although the frequency of nodes with > 0.6 bootstrap proportion (BP) varied little as filter strictness increased, the proportion of nodes with 0.7 or greater BP showed a steady increase. The most dramatic improvement was seen when taxa not represented by either an rbcL or ndhF sequence were excluded from the alignment—the proportion of nodes resolved with > 0.95 BP increased by 8%, with a concomitant and presumably related 10% increase in the decisiveness of the matrix. The best-resolved and best-supported tree (Figs. 4–7) was generated from the most strictly filtered alignment (Table 1, row 4), which contained only the 90% most stable tips from the alignment consisting only of those taxa represented by a sequence for at least one of the loci ndhF and rbcL. The majority-rule consensus topologies from the ML bootstrap searches corresponding to rows 1–4 of Table 1 are presented in Figure 3. The trees are presented without taxon names or support values because space constraints preclude legible presentation of these data in printed form (higher-resolution versions of trees 1–3 with legible taxon names and support values are available in Supplementary material, respectively; Data Dryad doi: 10.5061/dryad.6p76c3pb). Figure 3 clearly shows an overall increasing trend in the proportion of resolved nodes that accompanied more stringent filtering methods. The best-resolved consensus tree, corresponding to the most heavily filtered alignment (Table 1, row 4), is presented in Figures 4–7. Support values throughout most of the tree are high, especially for deep nodes, and relationships among major lineages of sedges are generally well-resolved, with strong support for the monophyly of the 2 subfamilies Mapanioideae and Cyperoideae, and generally strong support for most previously recognized major clades. The monophyly of many currently recognized genera, however, is not supported, in agreement with previous findings from the studies that generated the data we gathered from GenBank. DISCUSSION Data Decisiveness and Tree Reconstruction PDC (Sanderson et al. 2010; Steel and Sanderson 2010) provides a general method of assessing the effect of missing data on tree reconstruction methods, and it is of interest for supermatrix data mining because maximizing PDC is likely to improve phylogenetic resolution. This metric is independent from measures of phylogenetic signal; instead it describes the proportion of branches, out of the set of all branches in all possible trees, which may be reconstructed given the patterns of taxon overlap among markers in the data set. Sampling matrices with high decisiveness may be used to reconstruct most or all branches in many trees, whereas those with low decisiveness may lack the taxon overlap [13:29 28/1/2013 Sysbio-sys088.tex] VOL. 62 to even allow resolution of many or all branches, even if the sampled markers have strong phylogenetic signal. Because missing data can have such strong effects on tree searching, decisiveness should be of interest to all systematists working with even moderately incomplete data sets, and particularly so for those using very large, sparse data sets. However, achieving complete decisiveness is not necessarily a requirement for resolving the best tree (or set of trees), because under the common assumption that there are relatively few best trees (or just one), only a tiny fraction of all possible branches is represented in them (at least for large trees). As long as patterns of missing data do not preclude the resolution of these “good” branches, very good trees may be found using matrices with PDC considerably < 1, as we have demonstrated here. The PDC metric itself actually measures the proportion of 4-way combinations (quartets) of taxa that are represented by homologous data for at least a single locus in the alignment. Matrices within which every quartet is represented by a set of homologous sequences for at least one locus are fully decisive. Steel and Sanderson (2010) called this the 4-way partition property. The phylogenetic scaffolding approach we used relies on this property: it maximizes the proportion of homologous quartets/triplets by excluding taxa not represented for some subset of loci—the scaffold backbone. When a single locus is used as a backbone, scaffolding results in a completely decisive matrix, because the 4-way partition property is met through the presence of all taxa at that locus. We used 2 wellsampled—but not completely sampled—loci to form a backbone, which does not necessarily result in a completely decisive matrix, but has the advantage of allowing more tips to be included. Phylogenetic Informativeness and Scaffolding An additional advantage of phylogenetic scaffolding is that it provides a means to robustly reconstruct deep nodes in the tree (using the backbone loci), while allowing the inclusion of faster evolving markers to inform shallower relationships, even when sampling does not overlap among clades. Identifying good backbone markers requires consideration not only of how broadly sampled the candidate markers are but also the depths at which they will be maximally informative. One metric that may be helpful for selecting scaffold backbones is the phylogenetic informativeness of Townsend and Leuenberger (2011) and Townsend and López-Giráldez (2010), which quantifies the predicted phylogenetic signal for a given marker as a function of tree depth. As shown in Figure 2, some loci that are broadly sampled, such as ITS and trnL-trnF, are characterized by having informativeness peaks at relatively shallow depths. The peak informativness is presumably the point at which the sequence data reach saturation in terms of number of changes, and beyond which their signal Page: 210 205–219 2013 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX a) b) c) d) 211 FIGURE 3. Majority-rule consensus trees resulting from parallel 300-replicate ML bootstrap searches performed on each of the alignments summarized in Table 1. a)–d) correspond to rows 1–4 from Table 1, respectively. Although tip and branch labels could not be included due to space constraints, several topological landmark nodes corresponding to several major clades are labeled. The star labels the genus Carex, the square Eleocharis, the circle the tribe Cypereae, and the triangle the tribe Schoeneae. The dotted lines in c) and d) indicate an area of the tree that apparently experienced a decrease in resolution as a result of more stringent rogue taxon filtering, although as discussed in the text, this apparent loss of resolution may actually represent an increase in phylogenetic accuracy. is replaced by noise. Markers showing this pattern are presumed to be more informative for shallow nodes than deep ones. Loci exhibiting this pattern are of suspect utility for use as a scaffold backbone, but may be included in alignments containing strong signal in other loci for deep nodes. In the context of this study, loci such as rbcL and ndhF have lower predicted overall signal than some more rapidly evolving markers, but because they remain unsaturated deep in the tree, they are presumed to contain more information to resolve deep nodes (Fig. 2). Unsaturated loci such as these are better candidates for use as backbones. We have observed, however, that high predicted signal does not necessarily equate with high resolution even in clades [13:29 28/1/2013 Sysbio-sys088.tex] for which a locus is well-sampled (e.g. trnL-trnF for the genus Carex). The relatively low performance of the rbcL+trnL-trnF alignment (Table 1, row 1), despite the exceptional predicted signal for trnL-trnF, provides an example of this (Fig. 2). Rogue Taxon Filtering Filtering out the 10% least stable taxa from the mostinclusive alignment (Table 1, row 2) resulted in a 7% increase in the proportion of resolved nodes, whereas filtering the 10% least stable taxa from the scaffolded alignment resulted in a 5% increase. These differences Page: 211 205–219 212 SYSTEMATIC BIOLOGY VOL. 62 FIGURE 4. Majority rule consensus tree resulting from a 300-replicate ML search performed on the alignment corresponding to Table 1, row 4. Branch labels indicate BP. Major clades and grades of the Cyperaceae are labeled to the right. D. = Dulichieae; Fuir. grade = Fuireneae grade; T. = Trilepideae. Each labeled group occurs only once in the tree, but it may extend across multiple pages. In these cases, the parts of the group on different pages are labeled individually. [13:29 28/1/2013 Sysbio-sys088.tex] Page: 212 205–219 2013 213 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX To Fig. 6 71.67 76.67 69.67 95 60.67 77.67 50 99.67 99.67 98 56.33 99.33 98.67 93.67 95.67 82 97 100 99 91.67 98.67 98.67 80 100 70 100 91.33 71.33 93.33 94 100 84 75.67 61.67 64 98 75.67 63 62 100 88.67 78.33 59.33 76.67 83.67 96.67 72 99.67 87 52.33 67.67 97 51.33 83.33 53.67 FIGURE 5. 98.33 59 D. To Fig. 4 58.33 Scirpeae 95 94.33 Carex_pairae Carex_divulsa Carex_otrubae Carex_lamprocarpa Carex_echinata Carex_lachenalii Carex_canescens Carex_marina Carex_ovalis Carex_xerantica Carex_chordorrhiza Carex_cephalophora Carex_conferta Carex_maritima Uncinia_hamata Uncinia_phleoides Uncinia_nemoralis Uncinia_uncinata Uncinia_filiformis Carex_obtusata Carex_rupestris Carex_monostachya Carex_elynoides Carex_nardina Carex_capillacea Carex_leptalea Kobresia_prattii Kobresia_capillifolia Carex_backii Cymophyllus_fraserianus Kobresia_simpliciuscula Carex_pauciflora Kobresia_fragilis Kobresia_laxa Schoenoxiphium_filiforme Schoenoxiphium_burkei Schoenoxiphium_sparteum Schoenoxiphium_ecklonii Schoenoxiphium_ludwigii Carex_camptoglochin Carex_microglochin Carex_peregrina Carex_pulicaris Eriophorum_brachyantherum Eriophorum_chamissonis Eriophorum_angustifolium Eriophorum_vaginatum Scirpus_flaccidifolius Scirpus_hattorianus Scirpus_georgianus Scirpus_ancistrochaetus Scirpus_polystachyus Scirpus_karuizawensis Scirpus_mitsukurianus Scirpus_wichurae Scirpus_fontinalis Scirpus_pendulus Scirpus_atrocinctus Scirpus_cyperinus Scirpus_microcarpus Scirpus_sylvaticus Scirpus_expansus Scirpus_radicans Scirpus_orientalis Phylloscirpus_acaulis Phylloscirpus_deserticola Phylloscirpus_bolivianus Amphiscirpus_nevadensis Trichophorum_alpinum Trichophorum_pumilum Trichophorum_planifolium Trichophorum_clintonii Oreobolopsis_clementis Trichophorum_dioicum Oreobolopsis_tepalifera Oreobolopsis_inversa Trichophorum_rigidum Trichophorum_subcapitatum Zameioscirpus_muticus Zameioscirpus_atacamensis Trichophorum_cespitosum Eriophorum_crinigerum Khaosokia_caricoides Dulichium_arundinaceum Blysmus_compressus Carex 73.33 See legend to Figure 4. may seem small, but visual inspection of the resulting consensus trees (Fig. 3, compare 3a vs. 3b and 3c vs. 3d) confirms that they represent a meaningful increase in resolution. The newly resolved nodes are concentrated in areas of the tree that were mostly to entirely unresolved by the unfiltered alignments. Some areas of the tree [13:29 28/1/2013 Sysbio-sys088.tex] actually experienced an apparent decrease in resolution as a result of stringent rogue filtering (see dotted outlines in Fig. 3d), but the nodes that were lost were characterized by relatively low support values, and with their removal, support values for surrounding nodes increased. Page: 213 205–219 214 FIGURE 6. SYSTEMATIC BIOLOGY VOL. 62 See legend to Figure 4. [13:29 28/1/2013 Sysbio-sys088.tex] Page: 214 205–219 2013 FIGURE 7. HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX 215 See legend to Figure 4. [13:29 28/1/2013 Sysbio-sys088.tex] Page: 215 205–219 216 SYSTEMATIC BIOLOGY Several leaf instability indices exist, all of which can be used to filter rogues (Thorley and Wilkinson 1999; Thorley and Page 2000; Smith and Dunn 2008; Maddison and Maddison 2010). They serve similar purposes and should have essentially similar results, but their interpretations may differ. The statistic developed by Thorley and Wilkinson (1999), implemented in RadCon (Thorley and Page 2000) and phyutility (Smith and Dunn 2008), is based on branch support values and may be sensitive to them, while the I statistic employed by Maddison and Maddison (2010) in Mesquite measures taxon movement directly (as opposed to using support values), but is not standardized among data sets. The statistic I S that we employ also measures movement directly, but creates scores that may be directly compared among data sets. The python script implementing the I S calculations may also be applied to larger sets of larger trees than some of the other methods, using parallel computation. How the I S method compares to the graph-based approach RogueNaRok (Aberer et al. 2012) remains to be seen. Data Filtering and Sampling Strategies Although filtering taxa out of alignments reduces their taxonomic inclusivity, it is a tradeoff that may be useful for developing highly resolved trees (Kearney 2002; Wiens 2003b; Campbell and Lapointe 2009; Thomson and Shaffer 2010; Aberer et al. 2012). In the case of supermatrices, even heavily filtered data sets are often considerably more taxon inclusive than manually generated ones. In this study, our best-resolved tree has 435 tips—only 29% of the available Cyperaceae operational taxonomic units (OTU) on GenBank— but this still represents an addition of 173 OTUs (a 66% increase) when compared with the next largest published phylogeny for the group (Muasya et al. 2009). In addition, our trees present stronger support for deep relationships than any previous family level tree (Katsuyama et al. 2007; Muasya et al. 2009; Hinchliff et al. 2010), which is likely directly due to the greater number and variety of loci used to generate them. The sampling patterns shown in Figure 1 reflect a composite sampling strategy that has been applied to many organismal groups, because most phylogenetic studies have traditionally fallen into one of 2 sampling categories (i) many studies focus narrowly on a single (often large, species rich) clade, which is sampled extensively for a small number of rapidly evolving markers that may not be shared by other studies, while (ii) some studies take a broader sample across a deeper taxonomic scale of one or more relatively slowly evolving markers that are often not shared with the finerscale studies. High-throughput sequencing is likely to influence these strategies, but there will still exist slowly evolving markers that are broadly represented, and fast markers specific to certain groups. [13:29 28/1/2013 Sysbio-sys088.tex] VOL. 62 In the case of Cyperaceae, ndhF and rbcL have been sampled broadly by family level studies, whereas ITS, ETS and a variety of chloroplast markers have been used by infrageneric studies. Because of low levels of overlap among studies, Figure 1 contains a high proportion of blue cells representing absent data for most locus-taxon combinations. This apparent lack of sampling, however, is not necessarily problematic for phylogeny reconstruction because the nature of the phylogenetic sampling is geared specifically to address the relationships in the most reasonable trees, though careful selection of markers and taxa by the authors of phylogenetic studies. While careful addition of sequence data to sparse alignments can dramatically improve results (van der Linde et al. 2010; Brown J.W., unpublished data), it is typically redundant to require complete coverage for all markers, and prohibitively so for supermatrices. The power of these methods is their ability to tie together disparate but already information-rich data sets, and the generation of full alignments is not required for success (Wiens 2003a). This is simply because, given the sampling patterns inherent in phylogenetic studies, only a small subset of additional sequence data are likely to contain additional phylogenetic information—it is not advisable to exhaustively sample slowly evolving markers for closely related taxa, because they will contain little variation, nor does it makes sense to sequence rapidly evolving markers for distantly related taxa, because they will contain a great deal of noise. It is worth pointing out, however, that despite our abilities to accurately reconstruct relationships despite high levels of missing data, sparse alignments can (and do) negatively impact our ability to infer branch lengths (Lemmon et al. 2009). Some disagreement exists regarding the severity of this problem (see Wiens and Morrill 2011), and the topic of sampling optimization remains of interest (Yan et al. 2005; Mittelbach et al. 2007; Weir and Schluter 2007; Svenning et al. 2008; Sanderson et al. 2010; Thomson and Shaffer 2010; Townsend and López-Giráldez 2010; Cho et al. 2011). Implications for Large Combined Analyses Results from our large-tree analyses are promising. The best resolved trees from this study improve upon less inclusive family level trees (Katsuyama et al. 2007; Muasya et al. 2009; Hinchliff et al. 2010), suggesting an important role for effective methods to deal with large, sparse, and potentially noisy data sets As advances in sequencing technology drive the generation of more large, noisy data sets, these methods will become increasingly more important. Harnessing the power of these data sets requires an initial planning investment to ensure data integrity and to optimize phylogenetic resolution, but the result of successful planning and data management is very large, very well-resolved phylogenies, that in many cases may surpass previous expectations about the limits of possibility. Page: 216 205–219 2013 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX Although supermatrix methods may offer an alternative to the generation of novel sequence data, it is important to keep in mind that these approaches are complementary. Data mining methods rely on the adequacy of previous sampling, and many taxa and genomes remain absent from public databases (Fig. 1). Wet laboratory techniques fulfill the vital role of improving the coverage of these collaborative data sets, and the generation of novel sequence data is likely to remain essential to progress in systematic biology for the foreseeable future. Systematics and Classification of Cyperaceae Our results corroborate many previous findings regarding sedge phylogeny, but provide new results that contradict some earlier studies. One of the largest differences between this and the next most-inclusive sedge phylogeny in the literature (Muasya et al. 2009) is that the tree we present has very strong support for nearly all relationships among major clades. Several differences in topology also exist. Classification units referred to herein follow Simpson et al. (2007) and Goetghebeur (1998). For a thorough, concise discussion of classification, refer to Muasya et al. (2009). In our best-resolved sedge phylogeny (Figs. 4–7), subfamily Mapanioideae is strongly supported as sister to the rest of the Cyperaceae, with the Trilepideae sister to the all remaining, as in Muasya et al. (2009) (Fig. 4). However, we find a Sclerieae+Bisboeckelereae clade strongly supported as sister to all remaining Cyperaceae, which is incongruous with Muasya et al. (2009) where these tribes were nested within the Schoeneae. A monophyletic Sclerieae+Bisboeckelereae that is separate from Schoeneae better fits expectations based on morphological classifications (Goetghebeur 1998) and some previous molecular phylogenetic analyses (Simpson et al. 2007). Like Muasya et al. (2009), however, we resolve Didymiandrum and Lagenocarpus, members of tribe Cryptangieae that have previously been classified with Scleria (Goetghebeur 1998), within the Schoeneae with very high support (Fig. 4). We find support for this Schoeneae+Cryptangieae clade as sister to all other remaining lineages. Although relationships within the Schoeneae are generally unresolved, several wellsupported clades corresponding to major genera are present. Some genera in Schoeneae appear polyphyletic, such as Tetraria (corroborated by Verboom 2006) and Gahnia (though this finding is incongruent with Zhang et al. 2004). There is good support for a sister relationship between Cladium and the rest of the tribe (Fig. 4). The genus Rhynchospora is strongly supported as sister to all remaining lineages and contains all present members of the genus Pleurostachys (Fig. 4). Strong support exists for one large additional major clade consisting of 2 sister subclades: (i) Cariceae+Dulichieae+Khaosokia+Scirpeae; and (ii) Abildgaardieae+Cypereae+Eleocharis+Fuireneae (Figs. 5–7). Within the first of these, Dulichieae [13:29 28/1/2013 Sysbio-sys088.tex] 217 is supported in a sister relationship with a Cariceae+Khaosokia+Scirpeae clade, but relationships between Cariceae, Scirpeae, and the monotypic Khaosokia caricoides are unresolved. The monophyly of the genus Carex, if circumscribed to include Cymophyllus, Kobresia, Schoenoxiphium, and Uncinia (i.e. the Cariceae), is strongly supported (Figs. 5 and 6). Most species of Scirpeae present in this data set are not represented by sequences at rapidly evolving, data rich markers such as ndhF and ITS, and the addition of these markers may help clarify the relationships within this group and among Cariceae, Khaosokia, and Scirpeae. The monophyly of a clade containing the Abildgaardieae and Eleocharis is well supported, as is the monophyly of both Eleocharis and the Abildgaardieae themselves (Fig. 6). Relationships among the genera Bolboschoenus, Fuirena, Schoenoplectiella, and Schoenoplectus are poorly resolved (Figs. 6 and 7), and additional studies involving novel sequencing effort will likely be necessary to shed light on this problematic area. Within the Cypereae, a strongly supported sister relationship is indicated between (i) a Ficinia + Isolepis clade in which these 2 genera form nearly monophyletic groups, and (ii) a clade in which Scirpoides is sister to a clade containing a highly paraphyletic Cyperus with numerous other genera interdigitated throughout it (Fig. 7). Relationships within major clades are in general less well-resolved than those in the studies from which these data were originally published (Roalson and Friar 2000; Yen and Olmstead 2000; Muasya et al. 2001; Roalson et al. 2001; Muasya et al. 2002; Yano et al. 2004; Zhang et al. 2004; Chacón et al. 2006; Verboom 2006; Ghamkhar et al. 2007; Katsuyama et al. 2007; Simpson et al. 2007; Hinchliff and Roalson 2009; Muasya et al. 2009; Starr and Ford 2009; Thomas et al. 2009; Waterway et al. 2009; Hinchliff et al. 2010; Roalson et al. 2010). This may be due to decreased levels of decisiveness for shallow branches because of the addition of samples (such as outgroup taxa) that bridge taxonomic gaps among studies but lack overlap for sampled genes. Low resolution near the tips may also result from lower amounts of phylogenetic information available to inform shallow splits, because of the smaller numbers of taxa (and therefore fewer sampled loci) that can be used. We find strong support for recognizing the following clades (disregarding the taxonomic rank at which they are recognized): Mapanioideae, Trilepideae, Sclerieae (including Bisboeckelereae), Schoeneae (including Cryptangieae), Rhynchosporeae, Abildgaardieae (including Arthrostylideae), and Eleocharideae (Figs. 4–7). The Scirpeae+Cariceae clade (Figs. 5 and 6), and the Fuireneae+Cypereae clade (Figs. 6 and 7), each contain poorly resolved regions, which may contain clades that conflict with current classification units. The key question regarding the Scirpeae+Cariceae clade is whether the Scirpeae is monophyletic or forms a grade leading to Cariceae; both of these patterns have been resolved in published studies (grade: Hinchliff et al. 2010; clade, including Dulichieae: Muasya et al. 2009), although never with strong support. The phylogeny we Page: 217 205–219 218 present suggests the presence of 4 lineages, if Eriophorum crinigerum groups with the rest of Scirpeae: the Dulichieae (Dulichium + Blysmus), Khaosokia, Scirpeae, and Cariceae. More targeted work on the Scirpeae will be necessary to clarify this. The Fuireneae+Cypereae clade presents a similar problem: the monophyletic Cypereae contains 2 well-supported clades (Cyperus s.l. and Ficinia/Isolepis), but the taxa usually attributed to the Fuireneae form a polytomy below Cypereae (Figs. 6 and 7). Previous studies have seen these lineages positioned in many different locations, usually without strong support (Simpson et al. 2007; Muasya et al. 2009), but a study using ndhF and psbB-psbH (Hinchliff et al. 2010) showed strong support for a Fuireneae grade leading to the Cypereae. Additional sampling of ndhF and other data-rich cpDNA regions such as psbB-psbH and perhaps matK may help clarify these relationships. Overall, 9 clades are strongly supported and morphologically diagnosible (Mapanioideae, Trilepideae, Sclerieae, Schoeneae, Rhynchosporeae, Abildgaardieae, Eleocharis, and Cypereae), and should be recognized in a new classification, as previous classifications are clearly do not define phylogenetic lineages as we now know them. Additional research will clarify how many diagnosible lineages will need to be recognized within the Carex + Dulichieae + Khaosokia + Scirpeae clade and the Fuireneae assemblage. CONCLUDING REMARKS We have shown that highly incomplete alignments can produce well-resolved, highly taxon-inclusive trees, and that supermatrix data-mining methods that yield these are feasible to employ. The utility of highly incomplete alignments may be negatively impacted by missing data, but the PDC metric of Sanderson et al. (2010) provides a powerful way of assessing and extending these limits to inference. One such improvement method is matrix scaffolding: the removal of tips lacking data for one or more backbone loci known to contain strong signal for important nodes deep in the tree. We have shown that this method can dramatically improve both the PDC of a matrix and the resolution of the resulting trees. We also present a novel statistic for identifying tips with weak or conflicting signal, I S , based on previous work by Maddison and Maddison (2010), but which can be more easily interpreted across varied data sets. The trees we present represent the largest, most well-resolved sedge phylogenies published to date, and have direct implications for sedge systematics and classification. This study is exemplary of a recent but rapidly developing trend in systematics, where large, aggregate data sets gathered by automated tools can be used to yield answers to longstanding questions in biology at scales once thought impossible. As these methods continue to mature, we can expect to see greater data integration, more powerful tree-searching and statistical tools, and many new opportunities for the study of organic evolution on Earth. [13:29 28/1/2013 Sysbio-sys088.tex] VOL. 62 SYSTEMATIC BIOLOGY SUPPLEMENTARY MATERIAL Data files and/or other supplementary information related to this paper have been deposited on Dryad at http://datadryad.org under doi: 10.5061/dryad. 6p76c3pb. FUNDING This work was supported by the National Science Foundation [DEB 1011206 to C.E.H.]. ACKNOWLEDGEMENTS Useful discussions from WSU/UI PuRGe were invaluable, as was specific feedback from Matt Pennell, Luke Harmon, and Jeremiah Busch. Many thanks to Steve Orzell and Edwin Bridges of Avon Park, FL, for companionship and hospitality in the field that mistakenly went unobserved in an earlier paper. REFERENCES Aberer A.J., Krompass D., Stamatakis A. 2012. Pruning rogue taxa improves phylogenetic accuracy: an efficient algorithm and webservice. Systematic Biology (in press) doi:10.1093/sysbio/ sys078. Bruhl J.J. 1995. Sedge genera of the world: relationships and a new classification of the Cyperaceae. Aust. Syst. Bot. 8:125–305. Campbell V., Lapointe F.J. 2009. The use and validity of composite taxa in phylogenetic analysis. Syst. Biol. 58(6):560–572. Chacón J., Madriñán S., Chase M.W., Bruhl J.J. 2006. Molecular phylogenetics of Oreobolus (Cyperaceae) and the origin and diversification of the American species. Taxon 55(2):359–366. Cho S., Zwick A., Regier J.C., Mitter C., Cummings M.P., Yao J., Du Z., Zhao H., Kawahara A.Y., Weller S., Davis D.R., Baixeras J., Brown J.W., Parr C. 2011. Can deliberately incomplete gene sample augmentation improve a phylogeny estimate for the advanced moths and butterflies (Hexapoda: Lepidoptera)? Syst. Biol. 60(6):782–796. de Queiroz A., Gatesy J. 2007. The supermatrix approach to systematics. Trends Ecol. Evol. 22(1):34–41. Dahlgren R.M.T., Clifford H.T., Yeo P.F. 1985. The families of monocotyledons: structure, evolution, and taxonomy. Berlin: Springer-Verlag. Davis B.W., Li G., Murphy W.J. 2010. Supermatrix and species tree methods resolve phylogenetic relationships within the big cats, Panthera (Carnivora: Felidae). Mol. Phylogenet. Evol. 56(1):64–76. Edgar R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792–1797. Ghamkhar K., Marchant A.D., Wilson K.L., Bruhl J.J. 2007. Phylogeny of Abildgaardieae (Cyperaceae) inferred from ITS and trnL-F data. Aliso 23:149–164. Goetghebeur P. 1998. Cyperaceae. In: Kubitzki K., editor. The families and genera of vascular plants. Berlin: Springer. p. 164. Govaerts R., Simpson D.A., Bruhl J.J., Egorova T.V., Goetghebeur P., Wilson K.L. 2007. World checklist of Cyperaceae. London: Royal Botanic Gardens Kew. Hinchliff C.E., Lliully Aguilar A., Carey T., Roalson E.H. 2010. The origins of Eleocharis (Cyperaceae) and the status of Websteria, Egleria, and Chillania. Taxon 59(3):709–719. Hinchliff C.E., Roalson E.H. 2009. Stem architecture in Eleocharis subgenus Limnochloa (Cyperaceae): evidence of dynamic morphological evolution in a group of pantropical sedges. Am. J. Bot. 96(8):1487–1499. Holton T.A., Pisani D. 2010. Deep genomic-scale analyses of the metazoa reject Coelomata: evidence from single- and multigene Page: 218 205–219 2013 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX families analyzed under a supertree and supermatrix paradigm. Genome Biol. Evol. 2:310–324. Jones K., Purvis A., MacLarnon A., Bininda-Emonds O., Simmons N. 2002. A phylogenetic supertree of the bats (Mammalia: Chiroptera). Biol. Rev. 77:223–259. Katoh K., Kuma K., Toh H., Miyata T. 2005. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33:511–518. Katsuyama T., Hirahara T., Hoshino T. 2007. Suprageneric phylogeny of Japanese Cyperaceae based on DNA sequences from chloroplast ndhF and 5.8 S nuclear ribosomal DNA. Acta Phytotaxon. Geobot. 58(2):57. Kearney M. 2002. Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. Syst. Biol. 51(2):369–381. Lemmon A.R., Brown J.M., Stanger-Hall K., Lemmon E.M. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Syst. Biol. 58(1): 130–145. López-Giráldez F., Townsend J.P. 2011. PhyDesign: an online application for profiling phylogenetic informativeness. BMC Evol. Biol. 11(1):152. Maddison W.P., Maddison D.R. 2010. Mesquite: a modular system for evolutionary analysis. Vancouver, BC: University of British Columbia. McMahon M., Sanderson M.J. 2006. Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. Syst. Biol. 55(5):818. Mittelbach G., Schemske D.W., Cornell H.V., Allen A.P., Brown J.M., Bush M.B., Harrison S.P., Hurlbert A., Knowlton N., Lessios H. 2007. Evolution and the latitudinal diversity gradient: speciation, extinction and biogeography. Ecol. Lett. 10(4):315–331. Muasya A.M., Simpson D.A., Chase M.W. 2002. Phylogenetic relationships in Cyperus L. sl (Cyperaceae) inferred from plastid DNA sequence data. Bot. J. Linn. Soc. 138(2):145–153. Muasya A.M., Simpson D.A., Chase M.W., Culham A. 2001. A phylogeny of Isolepis (Cyperaceae) inferred using plastid rbcL and trnL-F sequence data. Syst. Bot. 26(2):342–353. Muasya A.M., Simpson D.A., Verboom G.A., Goetghebeur P., Naczi R., Chase M.W., Smets E. 2009. Phylogeny of cyperaceae based on DNA sequence data: current progress and future prospects. Bot. Rev. 75(1):2–21. Naczi R.F., Ford B.A. 2008. Sedges: uses, diversity, and systematics of the Cyperaceae. Saint Louis (MO): Missouri Botanic Garden Press. Ren F., Tanaka H., Yang Z. 2009. A likelihood look at the supermatrixsupertree controversy. Gene 441:119–125. Roalson E.H., Columbus J.T., Friar E.A. 2001. Phylogenetic relationships in Cariceae (Cyperaceae) based on ITS (nrDNA) and trnT-LF (cpDNA) region sequences: assessment of subgeneric and sectional relationships in Carex with emphasis on section Acrocystis. Syst. Bot. 26:318–341. Roalson E.H., Friar E.A. 2000. Infrageneric classification of Eleocharis (Cyperaceae) revisited: evidence from the internal transcribed spacer (ITS) region of nuclear ribosomal DNA. Syst. Bot. 25(2): 323–336. Roalson E.H., Hinchliff C.E., Trevisan R., Da Silva C. 2010. Phylogenetic relationships in Eleocharis (Cyperaceae): C4 photosynthesis origins and patterns of diversification in the Spikerushes. Syst. Bot. 35(2):257–271. Sanderson M.J., McMahon M., Steel M. 2010. Phylogenomics with incomplete taxon coverage: the limits to inference. BMC Evol. Biol. 10(1):155. Simpson D.A., Muasya A.M., Alves M., Bruhl J.J., Dhooge S., Chase M.W., Furness C.A., Ghamkhar K., Goetghebeur P., Hodkinson T., Marchant A.D., Reznicek A.A., Nieuwborg R., Roalson E.H., Smets E., Starr J.R., Thomas W.W., Wilson K.L., Zhang X. 2007. Phylogeny of Cyperaceae based on DNA sequence data—a new rbcL analysis. Aliso 23:72–83. Smith S.A., Beaulieu J.M., Donoghue M.J. 2009. Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches. BMC Evol. Biol. 9:37. Smith S.A., Donoghue M.J. 2008. Rates of molecular evolution are linked to life history in flowering plants. Science 322(5898): 86–89. [13:29 28/1/2013 Sysbio-sys088.tex] 219 Smith S.A., Dunn C.W. 2008. Phyutility: a phyloinformatics tool for trees, alignments and molecular data. Bioinformatics 24(5):715–716. Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688–2690. Starr J.R., Ford B.A. 2009. Phylogeny and evolution in Cariceae (Cyperaceae): current knowledge and future directions. Bot. Rev. 75(1):110–137. Steel M., Sanderson M.J. 2010. Characterizing phylogenetically decisive taxon coverage. Appl. Math. Lett. 23(1):82–86. Stevens P.F. 2008. Angiosperm Phylogeny Website. Available from: URL http://www.mobot.org/mobot/research/APweb/. Last accessed November 2012. Svenning J., Borchsenius F., Bjorholm S., Balslev H. 2008. High tropical net diversification drives the New World latitudinal gradient in palm (Arecaceae) species richness. J. Biogeogr. 35(3):394–406. Thomas W.W., Araújo A.C., Alves M. 2009. A preliminary molecular phylogeny of the Rhynchosporeae (Cyperaceae). Bot. Rev. 75(1): 22–29. Thomson R.C., Shaffer H.B. 2010. Sparse supermatrices for phylogenetic inference: taxonomy, alignment, rogue taxa, and the phylogeny of living turtles. Syst. Biol. 59(1):42–58. Thorley J.L., Page R.D.M. 2000. RadCon: phylogenetic tree comparison and consensus. Bioinformatics 16:486–487. Thorley J.L., Wilkinson M. 1999. Testing the phylogenetic stability of early tetrapods. J. Theor. Biol. 200(3):343. Townsend J.P., Leuenberger C. 2011. Taxon sampling and the optimal rates of evolution for phylogenetic inference. Syst. Biol. 60(3): 358–365. Townsend J.P., López-Giráldez F. 2010. Optimal selection of gene and ingroup taxon sampling for resolving phylogenetic relationships. Syst. Biol. 59(4):446–457. van der Linde K., Houle D., Spicer G., Steppan S. 2010. A supermatrixbased molecular phylogeny of the family Drosophilidae. Genet. Res. 92(1):25–38. Verboom G.A. 2006. A phylogeny of the schoenoid sedges (Cyperaceae: Schoeneae) based on plastid DNA sequences, with special reference to the genera found in Africa. Mol. Phylogenet. Evol. 38(1):79–89. Warnes G.R., Bolker B., Bonebakker L., Gentleman R., Huber W., Liaw A., Lumley T., Maechler M., Magnusson A., Moeller S., Schwartz M., Venables B. 2011. gplots package for R statistical framework. Rochester, NY: University of Rochester. Waterway M., Hoshino T., Masaki T. 2009. Phylogeny, species richness, and ecological specialization in Cyperaceae tribe Cariceae. Bot. Rev. 75(1):138–159. Weir J., Schluter D. 2007. The latitudinal gradient in recent speciation and extinction rates of birds and mammals. Science 315(5818): 1574. Wiens J.J. 2003a. Missing data, incomplete taxa, and phylogenetic accuracy. Syst. Biol. 52(4):528–538. Wiens J.J. 2003b. Incomplete taxa, incomplete characters, and phylogenetic accuracy: is there a missing data problem? J. Vertebrate Paleontol. 23(2):297–310. Wiens J.J., Morrill M.C. 2011. Missing data in phylogenetic analysis: reconciling results from simulations and empirical data. Syst. Biol. 60(5):719–731. Wolsan M., Sato J.J. 2010. Effects of data incompleteness on the relative performance of parsimony and Bayesian approaches in a supermatrix phylogenetic reconstruction of Mustelidae and Procyonidae (Carnivora). Cladistics 26(2):168–194. Yan C., Burleigh J.G., Eulenstein O. 2005. Identifying optimal incomplete phylogenetic data sets from sequence databases. Mol. Phylogenet. Evol. 35(2):528–535. Yano O., Katsuyama T., Tsubota H., Hoshino T. 2004. Molecular phylogeny of Japanese Eleocharis (Cyperaceae) based on ITS sequence data, and chromosomal evolution. J. Plant Res. 117:11. Yen A., Olmstead R. 2000. Molecular systematics of Cyperaceae tribe Cariceae based on two chloroplast DNA regions: ndhF and trnL intron-intergenic spacer. Syst. Bot. 25(3):479–494. Zhang X., Marchant A., Wilson K., Bruhl J.J. 2004. Phylogenetic relationships of Carpha and its relatives (Schoeneae, Cyperaceae) inferred from chloroplast trnL intron and trnL-trnF intergenic spacer sequences. Mol. Phylogenet. Evol. 31(2):647–657. Page: 219 205–219
© Copyright 2026 Paperzz