De Novo Genome Assembly of the Economically Important Weed Horseweed Using Integrated Data from Multiple Sequencing Platforms1[C][W][OPEN] Yanhui Peng, Zhao Lai, Thomas Lane, Madhugiri Nageswara-Rao, Miki Okada, Marie Jasieniuk, Henriette O’Geen, Ryan W. Kim, R. Douglas Sammons, Loren H. Rieseberg, and C. Neal Stewart, Jr.* Department of Plant Science, University of Tennessee, Knoxville, Tennessee 37996 (Y.P., T.L., M.N.-R., C.N.S.); Department of Biology, Indiana University, Bloomington, Indiana 47405 (Z.L., L.H.R.); Department of Plant Sciences (M.O., M.J.) and Genome Center (H.O., R.W.K.), University of California, Davis, California 95616; Monsanto, Inc., St. Louis, Missouri 63130 (R.D.S.); and Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada V6T 1Z4 (L.H.R.) Horseweed (Conyza canadensis), a member of the Compositae (Asteraceae) family, was the first broadleaf weed to evolve resistance to glyphosate. Horseweed, one of the most problematic weeds in the world, is a true diploid (2n = 2x = 18), with the smallest genome of any known agricultural weed (335 Mb). Thus, it is an appropriate candidate to help us understand the genetic and genomic bases of weediness. We undertook a draft de novo genome assembly of horseweed by combining data from multiple sequencing platforms (454 GS-FLX, Illumina HiSeq 2000, and PacBio RS) using various libraries with different insertion sizes (approximately 350 bp, 600 bp, 3 kb, and 10 kb) of a Tennessee-accessed, glyphosate-resistant horseweed biotype. From 116.3 Gb (approximately 3503 coverage) of data, the genome was assembled into 13,966 scaffolds with 50% of the assembly = 33,561 bp. The assembly covered 92.3% of the genome, including the complete chloroplast genome (approximately 153 kb) and a nearly complete mitochondrial genome (approximately 450 kb in 120 scaffolds). The nuclear genome is composed of 44,592 protein-coding genes. Genome resequencing of seven additional horseweed biotypes was performed. These sequence data were assembled and used to analyze genome variation. Simple sequence repeat and single-nucleotide polymorphisms were surveyed. Genomic patterns were detected that associated with glyphosate-resistant or -susceptible biotypes. The draft genome will be useful to better understand weediness and the evolution of herbicide resistance and to devise new management strategies. The genome will also be useful as another reference genome in the Compositae. To our knowledge, this article represents the first published draft genome of an agricultural weed. In the past few years, genomic approaches have revolutionized plant biology. Complete or draft genome data are currently available for tens of plant species (http:// www.phytozome.net). From the model plant Arabidopsis (Arabidopsis thaliana; Arabidopsis Genome Initiative, 2000) to important crop plants such as rice (Oryza sativa; Goff et al., 2002; Yu et al., 2002; International Rice Genome Sequencing Project, 2005), soybean (Glycine max; Schmutz et al., 2010), maize (Zea mays; Schnable et al., 2009), and chickpea (Cicer arietinum; Varshney et al., 2013) 1 This work was supported by the U.S. Department of AgricultureNational Institute of Food and Agriculture, Monsanto, the National Science Foundation (grant no. DBI–0820451), and the University of California, Davis, Genome Center. * Address correspondence to [email protected]. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: C. Neal Stewart, Jr. ([email protected]). [C] Some figures in this article are displayed in color online but in black and white in the print edition. [W] The online version of this article contains Web-only data. [OPEN] Articles can be viewed online without a subscription. www.plantphysiol.org/cgi/doi/10.1104/pp.114.247668 to economic woody plants such as wine grape (Vitis vinifera; Jaillon et al., 2007) and poplar (Populus trichocarpa; Tuskan et al., 2006), powerful complete genome data sets and tools allow the unprecedented ability to explore the genetics and genomics of plant form and function. Furthermore, genome sequences of additional plant species will soon be available from in-progress largescale sequencing projects (http://www.ncbi.nlm.nih. gov/genomes/PLANTS/PlantList.html). One critical group of plants that has been largely overlooked in the genomics revolution consists of economically significant agricultural weeds (Stewart, 2009; Stewart et al., 2009). Weeds cause about $36 billion annual damage in the United States alone (Pimentel et al., 2000). The cost is even higher if one includes weeds in pastures, golf courses, aquatic environments, etc. Unchecked, weeds effectively outcompete crops for resources and decrease the yield of food, feed, and fiber. Despite their profound economic significance, weedy plant genomics data are very scarce relative to other economically important plants. Weeds experienced surreptitious accidental domestication (Warwick and Stewart, 2005) and adaptation to changing agricultural environments to become among the world’s most successful plants. They are persistent, Plant PhysiologyÒ, November 2014, Vol. 166, pp. 1241–1254, www.plantphysiol.org Ó 2014 American Society of Plant Biologists. All Rights Reserved. 1241 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. Peng et al. adaptable, stress tolerant, competitive, and capable of extreme levels of reproduction (e.g. some species can produce one million seeds per plant). Herbicide application has been the dominant weed-control measure in recent decades, but increasingly, the evolution of herbicide resistance is challenging this practice. More than 400 biotypes of weeds have evolved resistance to one or more of all the major groups of herbicides (Heap, 2014), among which resistance to glyphosate is currently of greatest concern. Horseweed (Conyza canadensis), which is in the Compositae (Asteraceae) family, was the first eudicot weedy species to evolve glyphosate resistance and is also one of the most widely distributed glyphosateresistant species in the world (VanGessel, 2001; Heap, 2014). Horseweed is a true diploid (2n = 18), with a selffertilizing mating system and the smallest genome of any agricultural weed (335 Mb; Stewart et al., 2009). We sought to describe the reference genome of horseweed in order to enable genomic approaches to elucidate the mechanism of nontarget herbicide resistance and to gain a better understanding of the evolution and spread of herbicide resistance across agricultural landscapes. Weed genomics is critical to understand weed biology, and understanding weed biology is critical in weed management. To date, transcriptomes using next-generation sequencing of a few weedy species have been produced to aid the discovery of genes that contribute to herbicide resistance and other weediness traits, with most progress occurring in Conyza and Amaranthus spp. and Lolium rigidum (Lee et al., 2009; Peng et al., 2010; Riggins et al., 2010; Lai et al., 2012; Gaines et al., 2014). Horseweed shares many other weediness features with the world’s most damaging weeds. These features include a relatively short vegetative phase, indeterminate flowering, self-compatibility, high seed output, long-distance seed dispersal, competitiveness with crop plants, a deep root system, discontinuous dormancy, environmental plasticity, and allelopathy (Basu et al., 2004; Chao et al., 2005). Also, horseweed is amenable to genetic transformation and regeneration (Halfhill et al., 2007), which facilitates overexpression or knockdown analysis of potential gene targets. Autogamous, a single horseweed plant may produce well over 200,000 seeds. The small seeds have pappi that enable wind and water dispersal (Bhowmik and Bekech, 1993; Weaver 2001) to sometimes very long distances (DeVlaming and Proctor, 1968; Weaver and McWilliams, 1980; Andersen, 1993; Dauer et al., 2006; Shields et al., 2006). Moreover, horseweed seeds are not dormant and can germinate immediately after maturity in the fall, spring, or midsummer (Bhowmik and Bekech, 1993; Weaver, 2001). Thus, horseweed is a bona fide weed with model plant qualities. The Compositae is the largest and most diverse plant family, with over 24,000 described species. Despite the evolutionary success and economic importance of plants in this family, a draft whole-genome publication for any member of the family or for any plants from closely related families is lacking. However, genomes for lettuce (Lactuca sativa; https://lgr.genomecenter. ucdavis.edu), sunflower (Helianthus annuus; http:// www.sunflowergenome.org), and several other members of the family are available, but de novo genome assembly can often be arduous in many cases when genomes are large and complex (Kane et al., 2011; Truco et al., 2013). A draft sequence of a streamlined Compositae genome would greatly advance the genomics and genetics of economically, ecologically, and evolutionarily important plants within this important plant family. In this study, in order to further increase sequence resources for horseweed, we performed de novo genome sequencing and assembly of the Tennessee glyphosateresistant horseweed biotype TN-R using multiple sequencing platforms, including 454 GS-FLX, Illumina HiSeq 2000 and PacBio RS. Using a modified strategy that has proven effective in vertebrates (Li et al., 2010a), date palm (Phoenix dactylifera; Al-Dous et al., 2011), and flax (Linum usitatissimum; Wang et al., 2012), the draft genome of horseweed was assembled into scaffolds. Genomes and transcriptomes of this and another seven horseweed biotypes were also sequenced, and the data were used to investigate genome variation and to facilitate molecular marker development and gene discovery. The primary goal of this project was to build substantial genomic resources for horseweed, which will subsequently be useful to elucidate the genomic basis of weediness traits. RESULTS De Novo Sequencing and Assembly The reference genome source was DNA extracted from six individual plants of a single glyphosate-resistant biotype that was originally collected in 2002 from a soybean field in Jackson, TN (TN-R; Mueller et al., 2003). Comprehensive genome shotgun sequences were obtained using seven libraries with insert sizes that ranged from 350 bp to 10 kb and three sequencing platforms (454 GS-FLX, Illumina HiSeq 2000, and PacBio RS; Table I). Two plates of 454 GS-FLX runs on three independent libraries (approximately 600 bp) produced 2,238,807 raw reads with an average read length of approximately 390 bp, for a yield total of approximately 860 Mb data. Two lanes of HiSeq 2000 runs (2 3 100 bp) of two short paired-end libraries produced 786,389,990 raw reads, for a yield total of approximately 77 Gb data. One lane of a HiSeq 2000 run (2 3 100 bp) of one 3-kb matepaired library produced 381,844,926 raw reads, for a yield total of approximately 37 Gb data. Ten SMRT cells of PacBio RS runs produced 513,084 sequences with an average read length of 3.1 kb, which yielded 1.44 Gb of total data (Table II). All the sequence data together represented approximately 3503 coverage of the horseweed genome. De novo genome assembly is often not straightforward (Baker, 2012), in that many factors impact the integrity and accuracy of the draft assembly. To integrate data from multiple sequencing platforms, we used a de novo assembly strategy (Supplemental Fig. 1242 Plant Physiol. Vol. 166, 2014 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. De Novo Genome Assembly of Horseweed Table I. Summary of sequencing read yield results from multiple platforms and mate-paired libraries of various sizes (approximately 350 bp, 600 bp, 3 kb, and 10 kb) Platform Libraries Read Length Filtered Reads Data (Approximate Coverage) 454 GS-FLX Illumina PE Illumina MP PacBio RS Total 600 bp (3) 350 bp (2) 3 kb (1) 10 kb (1) Seven libraries 390 bp 2 3 100 bp 2 3 100 bp 3.1 kb 0.1–10 kb 2,238,807 786,389,990 381,844,926 513,084 1,170,986,807 860 Mb (2.53) 77 Gb (2303) 37 Gb (1103) 1.4 Gb (4.53) 116.3 Gb (3503) S1). Factors that could impact the assembly quality were tested, including Kmer (the possible subsequences of length k from reads obtained through DNA sequencing effort) size, sequencing depth and coverage, and the hybrid assembly of multiplex data (Supplemental Fig. S2). Increased coverage could increase assembly quality by increasing the 50% of the assembly (N50) value and assembled genome size and may also reduce the numbers of contigs. However, as the coverage reached 503, the effect of this factor was limited. For a given coverage, lower Kmer size translated into increased sensitivity, but with more assembly errors, while higher Kmer size translated into highly specific, accurate, and reliable assembly (Supplemental Fig. S2). Therefore, given adequate coverage, higher Kmer provided better assembly. To mitigate potential problems, we chose deep coverage (greater than 503), high Kmer (more than 25 nucleotides) cutoff for accuracy, and multiple sequencing platforms and library types. Indeed, the data that were integrated among multiple Next Generation Sequencing platforms and libraries with different insert sizes increased the quality of the de novo assembly (Supplemental Fig. S2). Newbler was used, since it was designed specifically for assembling sequence data generated by the 454 GSFLX (Margulies et al., 2005), whereas SOAPdenovo is among the most popular tools to assemble Illumina short reads (Li et al., 2010b). Using the PacBio RS sequencing platform, reads of over 10 kb in length were produced. However, the higher error rate prevented the direct use of PacBio reads in assembly (Supplemental Fig. S3). After error correction by mapping high-quality, but short, Illumina reads, the inclusion of the long reads from PacBio increased the quality of de novo genome assembly by using it for either the primary assembly or the guidance of scaffolding (Supplemental Figs. S3–S5). The filtered and high-quality 454 GS-FLX single reads and high-coverage Illumina paired-end reads were combined to generate de novo contigs. Mate-pair libraries included true mates mixed with chimeras and also paired ends. To find those true mates with correct orientation and distance, the de novo-assembled contigs were used as reference for the mapping of mate-pair reads (Supplemental Table S1; Supplemental Fig. S6). The total size of the assembled contigs was 311.3 Mb, in which N50 was composed of 5,122 sequences of 20,764 bp or longer (Fig. 1; Table II). The contigs were further assembled to 344.9-Mb scaffolds, in which the N50 increased to 33,561 bp (3,575 sequences). Without gaps, the assembly contained 92.3% genome coverage. The horseweed genome contained a 34.9% G/C base ratio, which was higher than sequenced woody eudicot genomes but equivalent to other eudicot genomes and much lower than monocot genomes (Fig. 1C; Supplemental Table S2). The G/C content of coding regions was 38.6%, which was higher than that of the entire genome (Fig. 1D). The assembled draft of horseweed genome data has been submitted to the National Center for Biotechnology Information with accession number SUB535309. To inspect the de novo assembly accuracy, multiple independent sources of horseweed DNA and complementary DNA sequences were used to align with the assembled genome (Table III). These sequences included EST reads from Sanger sequencing, transcriptome shotgun reads from 454 and Illumina sequencing, and partial genomic reads, which were used for the genome assembly (Zhou et al., 2009; Peng et al., 2010; Yuan et al., 2010; sample identifiers WSYE and NUSE from the OneThousand Plant Transcriptomes project [www.onekp. com]). Over 90% of sequences could be aligned to the genome. The mapping and BLAST varied among sequence reads. However, in the aligned sequences, 91.2% or more had over 90% coverage and identity. The nonperfect alignments indicated that EST reads could be used to further improve assembly accuracy and gene prediction. Identification of Repetitive Elements and Annotation of Protein-Coding Genes By using the plant repeat database (RepeatMasker libraries 20140131) and RepeatMasker 4.0.5, a total of 233,521 loci (19.5 Mb) were identified to be various nucleotide repeat elements. These sequences represented 6.25% of the assembled genome (Table IV) and were composed of 172,825 simple sequence repeats (SSRs), Table II. Summary of the de novo genome assembly of horseweed N90 and N70, 90% and 70% of the assembly, respectively. Parameter Contigs Scaffolds Sequence number 20,075 13,966 8,698; 14,599 13,506; 10,318 N90 (bp; n) N70 (bp; n) 14,517; 8,886 23,243; 6,219 N50 (bp; n) 20,764; 5,122 33,561; 3,575 Average length (bp) 16,258 26,546 Maximum single sequence (bp) 102,072 182,395 Base pairs (Mb) 311.27 344.88 Plant Physiol. Vol. 166, 2014 1243 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. Peng et al. Figure 1. De novo genome assembly of horseweed. A, Assembly units were sorted by descending length, and the cumulative DNA content was plotted as a function of the total number of assembly units. B, Length distributions of de novo-assembled contigs. C, G/C content distributions in horseweed computed over a bin size of approximately 500 bp. D, G/C content distribution of unique coding DNA sequences. 28,494 low-complexity elements, 23,425 different types of retroelements, 5,424 small RNA structures, 2,283 DNA transposons, and 675 unclassified elements. Retroelements were the largest mobile elements found in the genome (2.61%), of which most were long terminal repeat-type retroelements (2.54%). Small RNAs contributed 0.61% to the genome, whereas identified DNA transposons represented just 0.16% of the genome. Masked SSR loci were analyzed using the Simple Sequence Repeat Identification Tool (http://archive. gramene.org/db/markers/ssrtool) with a threshold minimum of six repeats. The results included 51,892 loci (Supplemental Table S3) containing 44,589 dimer (85.9%), 6,864 trimer (13.2%), 299 tetramer (0.58%), 60 pentamer (0.12%), and 80 hexamer (0.15%) motifs. The majority of these loci contained six to nine repeats (Supplemental Fig. S7). Considering the position in scaffold, repeat number, and length, 1,316 loci were identified to have appropriate sequences to develop PCR primers, which could be used as SSR markers (Okada et al., 2013). Using previous EST and transcriptome data as a starting point (Zhou et al., 2009; Peng et al., 2010; Yuan et al., 2010), 44,592 horseweed gene models were predicted. A total of 25,439 unique transcripts were returned after removing redundancies. The longest gene (using coding sequence only and without tallying introns) was composed of 17,120 bp. It was annotated to encode a midasin-like protein. The mean and median transcript lengths were 2,210 and 1,830 bp, respectively. Over 95% of the transcripts were longer than 715 bp (Fig. 2A). The predicted gene sequences were further compared with the UniProt and National Center for Biotechnology Information nonredundant protein databases for assigning biological information. Using a threshold e value of 1024 or less, 79.6% of these Table III. Alignment of ESTs, transcriptome sequencing (RNAseq), and genomic shotgun sequences to scaffolds The threshold of alignment coverage and identity was 90% or greater. Sequence Type Sequences Aligned Aligned Alignment Coverage n Sanger ESTs (more than 600 bp) 454 GS-FLX RNAseq (more than 200 bp) Illumina RNAseq (100 bp) 454 GS-FLX genomic reads (more than 390 bp) Illumina genomic reads (100 bp) 5,482 145,796 3,987,575 499,955 9,315,014 Alignment Identity % 5,012 132,579 3,591,820 450,754 8,476,173 91.4 90.9 90.1 90.2 91.0 91.2 92.1 91.6 $95 $95 1244 93.7 92.6 94.5 $99 $99 Plant Physiol. Vol. 166, 2014 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. De Novo Genome Assembly of Horseweed Table IV. Repeat element analysis in the horseweed genome For elements, most repeats were fragmented by insertions or deletions and have been counted as one element. Repeat Elements Elements Length n bp Retroelements 23,425 8,509,514 Short interspersed 29 1,433 elements Long interspersed 902 219,700 elements Long terminal repeat 22,889 8,288,381 elements Yeast transposon 15,599 4,999,694 family1/Copia Gypsy/Dictyostelium 7,110 3,276,883 intermediate repeat sequence1 DNA transposons 2,283 531,765 Hobo-Activator 337 61,039 Tc1-IS630-Pogo 719 166,937 Enhancer/suppressor 2 644 mutator Tourist/Harbinger 418 137,376 Unclassified 675 278,725 Small RNAs 5,424 2,003,521 Simple repeats 172,825 7,669,927 Low complexity 28,494 1,400,133 Total 233,521 20,393,585 Percentage of Genome 2.61 0.00 0.07 2.54 1.54 1.00 0.16 0.02 0.05 0.00 0.04 0.09 0.61 2.35 0.43 6.25 (20,248 of 25,439) had at least one hit to the database. The majority of the alignments had an identity of 50% or greater (Fig. 2C). Within the annotated genes, 18,406 returned hits at e values of 10210 or less. Moreover, most of the hits had only one high-scoring segment pair with the aligned genes (Fig. 2, B and D). Overall, the majority of the predicted genes in the assembled horseweed genome could be aligned with hits in a known protein database with high identity and specificity, which indicated that most of these genes were likely proteincoding sequences. Gene Ontology (GO) annotation resulted in 14,897 GOannotated genes. The largest gene category with molecular function was kinase activity (1,780), following by nucleotide binding (1,740), transporter activity (1,031), and another 15 subgroups with 50 or more terms (Supplemental Fig. S8A). Among genes annotated to various biological processes, the largest subgroup was stress-response genes (5,446), followed by cellular component organization-related genes (4,582), transport (3,799), and 27 additional groups with 50 or more terms (Supplemental Fig. S8B). The potential cellular localization of the genes was predicted by assigning 17,878 cellular component GO terms to various queries (Supplemental Fig. S8C). Genes that encoded proteins with enzyme activity were further divided into six groups (transferases, hydrolases, oxidoreductases, ligases, lyases, and isomerases; Supplemental Fig. S8D). Fisher’s exact test showed 10 GO terms that were significantly overrepresented in horseweed compared with Arabidopsis (Fig. 3). Enrichment analysis using the GO terms as well as related gene families included vacuolar transport, photosynthetic acclimation, detoxification of nitrogenous compounds, response to UV-B light, chloroplast thylakoid lumen, protein targeting to chloroplasts, protein peptidyl-prolyl isomerization, peptidyl-prolyl cis-trans-isomerase activity, hydrolase activity (acting on carbon-nitrogen), glycolysis, and NAD(P)H dehydrogenase complex assembly. Four transporter subgroup GO terms were overrepresented in the horseweed genome at the P , 0.001 level: drug transmembrane transporters (GO:0015238), xanthine transmembrane transporters (GO:0042907), uracil transmembrane transporters (GO:0015210), and allantoin uptake transporters (GO:0005274; Supplemental Fig. S9). However, these data are not sufficient to draw conclusions about biological function and the evolution of herbicide resistance; further gene expression analysis and other follow-on experiments need to be performed. Gene families that are commonly associated with nontarget herbicide resistance include cytochrome P450s, glutathione S-transferases (GSTs), glycosyltransferases (GTs), and ATP-binding cassette (ABC) transporters (Yuan at al., 2007). Relative to Arabidopsis, the horseweed genome has more members in each of these families (Table V). There were 323 unique cytochrome P450 genes at 401 loci in horseweed, which represents a 26% increase over Arabidopsis. Also, 155 ABC transporter genes at 213 loci were found in horseweed, which is 14% more than in Arabidopsis. Horseweed had 6% more GTs than Arabidopsis. There were 54 horseweed GSTs, which is one more gene than is found in Arabidopsis (Table V). Plastid and Mitochondrial Genomes Plastid genome sequences from 454 GS-FLX reads were isolated by searching against known plastid genome databases and then subjected to de novo assembly. The assembled plastid genomes were further mapped with accurate Illumina reads to fix homopolymer errors. The entire chloroplast genome (approximately 153 kb; Supplemental Fig. S10) and most of the mitochondrial genome (approximately 450 kb in scaffolds) were obtained (Supplemental Table S4; Supplemental Data Sets S1 and S2). The horseweed chloroplast genome contained two inverted repeats (24,936 bp each), a large single-copy fragment (84,634 bp), and a small single-copy fragment (18,063 bp). The orientation of the two inverted repeats was determined by PCR. A total of 95 protein-coding genes (88 unique), 39 tRNA genes (28 unique), and eight ribosomal RNA genes (four unique) were annotated (Supplemental Data Set S1). These genes comprised 59.7% of the chloroplast genome. The remainder, the noncoding portion of the chloroplast genome, was composed of introns, intergenic spacers, and pseudogenes. The largest gene was the ycf2 gene, which was 6,085 bp. The smallest gene was a 34-bp tRNA gene. The G/C content in the chloroplast genome was 37.2%. The chloroplast genome was nearly identical to that of lettuce Plant Physiol. Vol. 166, 2014 1245 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. Peng et al. Figure 2. Annotation of protein-coding transcripts. A, Distribution of lengths of predicted transcripts. B, Identity distribution of alignments with hits in the database. C, E value distribution of aligned hits. D, Distribution of high-scoring segment pairs (HSPs) from aligned hits. and sunflower, except in the orientation of the two inverted repeat elements (Supplemental Fig. S11). The lettuce chloroplast has reverse inverted repeat elements compared with horseweed and sunflower. The horseweed mitochondrial genome reads from 454 GS-FLX, Illumina paired-end, and true-mate pairs were parsed and assembled into 123 scaffolds (N50 = 10,057 bp), with the help of plant mitochondrial databases. The scaffold sizes ranged from 315 to 43,498 bp, for a total of 453,334 bp in the mitochondrial genome (Supplemental Fig. S12; Supplemental Data Set S2), which was moderately sized compared with other plant mitochondrial genomes. Genome Resequencing and Genomic Variation Analysis We resequenced four population pairs that were either glyphosate resistant (R) or susceptible (S) from the United States. The representative populations used for resequencing were from California (CA-R versus CA-S), Delaware (DE-R versus DE-S), Indiana (IN-R versus IN-S), and Tennessee (TN-R versus TN-S). Individual seedlings surviving a glyphosate application were used for eventual DNA donors, wherein DNA was pooled for sequencing (Table VI; Supplemental Fig. S13). Within two HiSeq 2000 flow cell lanes, a total 55.6 Gb (1703 coverage) of paired-end reads was produced, and each of the seven additional genomes was assembled using TN-R as the reference genome (Table VI). Interpopulation genome variation included multiple-nucleotide variation, which was defined as 4 bp or less, with equal numbers of nucleotides at each locus, single-nucleotide variation, insertions/deletions, and replacement (Table VII). When a minimum cutoff value (20 times or greater coverage at the variant loci, 20% or greater frequency) was applied to avoid calling errors, the frequency of detected variants in the genome varied among biotypes, ranging from 0.75 to 1.59 counts per kb (mean of 1.01; Table VII). Compared with the TN-R reference genome, CA-R had the fewest variants, followed by, in ascending order, IN-R, CA-S, IN-S, DE-R, TN-S, and DE-S. IN-R, IN-S, and TN-S had fewer variants relative to each other compared with the rest of the other biotypes. Compared with their 1246 Plant Physiol. Vol. 166, 2014 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. De Novo Genome Assembly of Horseweed Figure 3. Differential GO term distribution by enrichment analysis of the horseweed and Arabidopsis genomes using the Fisher’s exact test. The entire transcriptome in horseweed was set as the test data set, and the entire transcriptome in Arabidopsis was set as the reference data set (P # 0.001). [See online article for color version of this figure.] partners (DE-S and CA-S, respectively), DE-R and CA-R had the fewest variants. Similarly, IN-R had fewer variants than IN-S (Supplemental Table S5). The minimum average quality score for the position of each variant was 30 (one error in 1,000; Fig. 4A), which ensured that the results were quite reliable. To determine whether a variant was homozygous, we applied a threshold of 90% probability and found that IN-R and CA-R had the highest frequency of heterozygosity (nearly 80%), whereas only 50% of variants were heterozygous in the DE-R, DE-S, and TN-S biotypes (Fig. 4B). Most of the sequence variation among samples was composed of single-nucleotide variations/ single-nucleotide polymorphisms (SNPs) in all eight biotypes (Fig. 4C). Fisher’s exact test showed that a total of 2,010 specific variants were found in all four glyphosateresistant biotypes, which were distributed among 1,370 loci. Only 539 specific variants were found in all four glyphosate-susceptible biotypes, and these were located at 425 loci (Fig. 4D). Since the physical map of the horseweed genome was not available, the location of each specific variant in the sorted scaffold arrays instead of in the genome was listed (Supplemental Fig. S14). A genome-wide association study requires SNP information from the genomes of many control and test individuals. Specific SNPs in glyphosate-resistant and glyphosate-susceptible genomes were located in hotspots, which were defined as the ones with low P values (P # 0.01), and might be associated with the phenotypes of glyphosate resistance (Supplemental Fig. S14). However, these findings are not definitive, given the low sample sizes of biotypes analyzed. Phylogenetic analysis of whole-genome SNP variation provided a phylogeographic hypothesis regarding the evolution of glyphosate resistance and its spread (Fig. 5). Biotypes from proximate locations generally shared the same clades, such as the Delaware pairs and the Indiana pairs, which is consistent with our previous studies (Yuan et al., 2010; Okada et al., 2013). Table V. Analysis of gene families (cytochrome P450s, GSTs, ABC transporters, and GTs) that are potentially involved in nontarget glyphosate resistance in horseweed compared with those tallied for Arabidopsis Loci data refer to the total potential copy numbers in the gene family. Gene Family Arabidopsis Cytochrome P450s GSTs ABC transporters GTs 256 53 136 361 Plant Physiol. Vol. 166, 2014 Horseweed 323 54 155 381 (401 Loci) (54 Loci) (213 Loci) (457 Loci) 1247 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. Peng et al. Table VI. Summary of genome resequencing of seven additional horseweed biotypes using the Illumina HiSeq 2000 platform R, Resistant; S, susceptible. Population Identifier Accession Location Phenotype Coverage CA-S CA-R DE-S DE-R IN-S IN-R TN-S Fresno, CA Fresno, CA Georgetown, DE Georgetown, DE Knox County, IN Knox County, IN Jackson, TN S R S R S R S 363 193 243 283 163 163 303 Although not in the same clade, the TN-R branch was proximate to TN-S. The position of CA-S on the phylogram was basal, which might be of biological importance with regard to gene flow in the species. The phylogenetic pattern suggests that glyphosate resistance in horseweed has evolved independently multiple times (Fig. 5). 5-Enolpyruvylshikimate-3-Phosphate Synthase Genes In the horseweed genome, there are two 5-enolpyruvylshikimate-3-phosphate synthase (EPSPS) gene copies (EPSPS1 and EPSPS2); thus, no gene amplification was observed. Eleven EPSPS1 and 32 EPSPS2 variants were detected (Supplemental Fig. S15). However, no EPSPS mutation cosegregated with glyphosate resistance in our data. Furthermore, none of the observed EPSPS sequence variants have been previously reported to be associated with resistance. There also were no significant differences in EPSPS gene expression level among these eight biotypes in response to glyphosate treatment, based on an RNAseq study (Y. Peng, Y. Sang, S. Allen, and C.N. Stewart, unpublished data). Therefore, we conclude that glyphosate resistance in horseweed is conferred by a non-target-site mechanism. DISCUSSION To our knowledge, horseweed is the first economically important weed to have its genome sequenced and assembled. The de novo strategy using wholegenome shotgun sequencing in this study was similar to those used in several recent reports (Al-Dous et al., 2011; Wang et al., 2012; Varshney et al., 2013), except that we also included third-generation sequencing reads in our study. The 92.3% coverage of 13,996 scaffolds should be considered to be a draft genome comparable to rice, soybean, maize, chickpea, grape, poplar, and date palm (Goff et al., 2002; Yu et al., 2002; Tuskan et al., 2006; Jaillon et al., 2007; Schnable et al., 2009; Schmutz et al., 2010; Al-Dous et al., 2011; Varshney et al., 2013). Further improvements in assembly accuracy and backfilling will require the use of either long-insertion mate-pair reads or physical maps (International Rice Genome Sequencing Project, 2005; Wang et al., 2012; AlMssallem et al., 2013). The draft horseweed genome has proven to be immediately useful. We have successfully cloned 10 target promoters of interest to study glyphosate resistance and response (data not shown). Moreover, 12 SSR loci have already been tested and successfully used in a recent horseweed population genetics study (Okada et al., 2013). Also, the genome sequences have been used to annotate the explosive composition B (hexahydro-1,3,5trinitro-1,3,5-triazine and 2,4,6-trinitrotoluene) response genes in Baccharis halimifolia (Ali et al., 2014). Furthermore, we observed that the horseweed chloroplast genome is nearly identical to those of sunflower and lettuce (https://lgr.genomecenter.ucdavis.edu). Higher plants have various mitochondrial genome architectures with respect to size and construction compared with the chloroplast genome, which is more conserved (Tian et al., 2006). Brassica carinata has a small mitochondrial genome (232,241 bp), whereas Cucurbita pepo has a much larger chloroplast genome (982,832 bp; Supplemental Table S4). Next-generation sequencing technologies have been widely used in whole-genome sequencing and resequencing, which has led to the development of rapid genome-wide SNP detection applications in model and crop plant species for exploring within-species diversity, genotyping by sequencing, construction of haplotype maps, and performing genome-wide association studies of interesting traits (Craig et al., 2008; Huang Table VII. Summary of genomic variation among the sequenced horseweed biotypes For each biotype, genomic DNA was isolated from six individual plants. Variants having frequencies of 90% or greater were considered to be homozygous, whereas variants having frequencies between 20% and 90% were considered to be heterozygous. MNV, Multiple-nucleotide variation; SNV, single-nucleotide variation. *Value of average frequency. Biotype MNV SNV Deletion Insertion Replacement Heterozygous Homozygous Total No. Size kb per Variant bp CA-R CA-S DE-R DE-S IN-R IN-S TN-R TN-S Total 7,069 13,253 11,215 13,024 9,890 10,394 6,650 9,725 81,220 188,525 279,016 352,016 419,703 273,165 298,866 174,915 330,880 2,317,086 35,475 10,329 33,265 41,670 16,479 20,132 48,417 58,920 264,687 3,664 5,407 10,712 11,843 6,049 7,360 3,646 10,432 59,113 621 1,035 1,210 1,255 832 965 617 1,043 7,578 184,596 235,416 228,343 234,044 248,044 270,989 155,282 205,585 1,762,299 50,758 73,624 180,075 253,451 58,371 66,728 78,963 205,335 967,305 235,354 309,040 408,418 487,495 306,415 337,717 234,245 410,920 2,729,604 1248 247,452 329,025 432,327 516,307 323,209 356,007 246,577 433,873 2,884,777 1.32 0.99 0.76 0.63 1.01 0.92 1.33 0.75 0.96* Plant Physiol. Vol. 166, 2014 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. De Novo Genome Assembly of Horseweed Figure 4. Summary of genome variants and distribution among biotypes. A, Total variant counts in different biotypes. B, Percentage of heterozygous and homozygous variants among biotypes. C, Distribution of variant types. Del, Deletion; Ins, insertion; MNV, multiple-nucleotide variation; Rep, replacement; SNV, single-nucleotide variation. D, Number of specific variants in glyphosate-resistant (GR) and glyphosate-susceptible (GS) populations. [See online article for color version of this figure.] et al., 2009, 2010; Atwell et al., 2010; Todesco et al., 2010; Wu et al., 2010; Kump et al., 2011; Deschamps et al., 2012). With the reported horseweed genome data, 51,892 SSR markers and over 2.7 million SNPs, as well as chloroplast DNA markers, were identified. Although glyphosatesusceptible and glyphosate-resistant plants, obviously, had opposite phenotypic responses to glyphosate treatment, 99.93% of their genomes were identical. We found many SNPs in glyphosate-resistant or glyphosatesusceptible genomes that were located in hot pots, which were defined as those with low P values (P # 0.01), and might be associated with glyphosate resistance. However, many SNPs will be silent at the protein sequence level and not of interest for further study in terms of biological functions. Other mutations involved in gene expression regulation (e.g. within promoter regions) are of potential interest for elucidating resistance and other traits. Interestingly, horseweed exhibited considerable morphological variability among accessions (Supplemental Fig. S13); thus; integrating genomic information with resistance phenomena could be useful in weed science not only in studying the rapid evolution of herbicide resistance but also other traits. The evolution of glyphosate resistance is a critical process that, in most cases, has been elucidated with regard to genomics and molecular mechanisms. For over a decade, we have known that glyphosate resistance in horseweed is a semidominant trait that is determined by a single simple Mendelian locus (Zelaya et al., 2004). The major focus of weed scientists to combat resistance has been to explore more herbicide control strategies while essentially ignoring the genomic and evolutionary bases of resistance. With few genomic resources for weeds and little expertise to utilize the available resources among the weed science community (Stewart, 2009), discovering the molecular mechanisms underlying nontarget resistance has been extremely difficult. However, now that the genomics era has found its way to weed science, we can begin to answer fundamental questions about what makes weeds so weedy and capable of adapting to control measures and, thereby, design approaches that reduce the further evolution of herbicide resistance in weeds (Basu et al., 2004; Stewart et al., 2009). The currently known mechanisms of glyphosate resistance in weedy plants include EPSPS target-site mutations (Baerson et al., 2002; Collavo and Sattin, 2012; Sammons Plant Physiol. Vol. 166, 2014 1249 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. Peng et al. Figure 5. Phylogenetic tree of sequenced horseweed biotypes based on whole-genome single-nucleotide variation profiles using the shrunk-genomes method of the program PhyloSNP (Faison et al., 2014) with a position delta of zero. Bootstrap supports were all 100%, from 1,000 iterations. The tree was built by using the neighbor-joining method in Phylip 3.695. The scale bar indicates the number of genetic changes per unit of length. and Gaines, 2014), EPSPS gene amplification (Gaines et al., 2010, 2013; Jugulam et al., 2014), and vacuolar sequestration of glyphosate (Ge et al., 2010, 2012, 2014); however, in the last case, the genes conferring resistance have not been identified. Non-target-site resistance, especially via altered sequestration, is interesting both biologically and practically for weed management, but is not well characterized at the genomic or molecular level for most weeds (Yuan et al., 2007). This is the case with the many biotypes of glyphosate-resistant horseweed as well as several other glyphosate-resistant weedy plant species, such as johnsongrass (Sorghum halepense), ryegrass (Lolium spp.), velvet bean (Mucuna pruriens), and wild radish (Raphanus raphanistrum; Mueller et al., 2003; Feng et al., 2004; Main et al., 2004; Zelaya et al., 2004; Koger and Reddy, 2005; Owen and Zelaya 2005; Preston and Wakelin, 2008; Ge et al., 2010, 2012; Riar et al., 2011; Rojano-Delgado et al., 2012; Vila-Aiub et al., 2012; Ashworth et al., 2014; Sammons and Gaines, 2014). Physiologically, we know that non-target-site resistance can involve a plethora of metabolic, conversion, sequestration, and reduced translocation processes, including oxidation, conjugation, or compartmentation of the herbicide molecules (Yuan et al., 2007; Cummins et al., 2013; Iwakami et al., 2014). Horseweed represents a model weedy plant with a nontarget-site glyphosate-resistant mechanism via altered translocation or transport (Feng et al., 2004; Stewart et al., 2009; Ge et al., 2010). An important goal of effective weed management is to stop or slow the evolution and spread of herbicide resistance in weeds. Thus, it is vital to understand whether glyphosate resistance originated once and spread from a single source horseweed population or originated multiple independent times within distinct populations. If resistance originated once and spread, say, via seeds, then the molecular mechanisms would be expected to be the same or similar among populations because of identical descent. In that case, the prevention of seed dispersal and movement by machinery or other means would be an effective resistance management strategy. Alternatively, if resistance originated multiple times and spread from multiple sources, reduction in both seed dispersal and selection pressure will be needed. Management would also benefit from understanding evolutionary dynamics. In the case of horseweed, the evolution of glyphosate resistance occurred independently in multiple locations and might be caused by more than one nontarget resistance gene (Yuan et al., 2010; Okada et al., 2013; this study). Moreover, approximate Bayesian computation (ABC) analyses of microsatellite marker variation indicated that resistant populations underwent expansion after greatly increased glyphosate use in California in the 1990s, but many years before it was detected, strongly suggesting that diversity in weed-control practices prior to herbicide regulation probably kept resistance frequencies low (Okada et al., 2013). Data from this study also seem not to support the long-range rapid spread of resistance from a single source. At one time, the application of a combination of herbicides simultaneously or in sequence was considered to be an effective resistance management strategy. However, resistance to multiple herbicides is becoming increasingly common (Ashworth et al., 2014; Heap, 2014), and this approach is no longer consistently effective. Herbicides with new modes of action or more diversified control measures are desperately needed. Understanding the genomic mechanisms underlying nontarget resistance would allow the development of novel approaches, such as allelopathy and the expanded use of safeners. For the first time, we have genomics data to begin to address the genetic basis of weediness. Do weeds have unique genes that endow weediness or merely variants of genes common to all plants? Most researchers have posited that the latter situation is most plausible (Basu et al., 2004; Stewart et al., 2009). Indeed, our finding lends credence to the gene family expansion hypothesis, given that nontarget resistance candidate gene families, ABC transporters, and cytochrome P450 genes, and to a lesser extent GST and GT genes, were overrepresented in horseweed relative to other published plant genomes. Weeds, at least some of them, might have an inordinate adeptness for rapid evolution that is manifested by herbicide resistance; gene family expansion and/or changes in gene expression might be why weeds are weeds (Yuan et al., 2007; Shaner, 2009; Ge et al., 2010; Peng et al., 2010; Gaines et al., 2014; Sammons and Gaines, 2014). Other genes of interest with enriched GO terms, such as vacuole transport, photosynthetic acclimation, and detoxification of nitrogenous compounds, were found (Fig. 3). Finally, horseweed seems to have relatively more protein-coding genes (44,592; also supported by EST data) in comparison with other sequenced plant genomes while maintaining the smallest genome of all economically important weeds (Sterck et al., 2007; Stewart et al., 2009; Paterson et al., 2010). However, de novo genome assembly (contigs or scaffolds) tends to underestimate repeat numbers and overestimate gene numbers, because repeats are compacted on each other and genes are somewhat fragmentary and get annotated as two unique genes in some cases (Chaisson and Pevzner, 2008; Baker, 2012; Seabury et al., 1250 Plant Physiol. Vol. 166, 2014 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. De Novo Genome Assembly of Horseweed 2013). Repeat numbers in the horseweed genome are lower than in most other sequenced plant genomes, in part because horseweed has a relatively compact genome. However, read-mapping analyses revealed that the read depth of repeated regions was much higher than for nonrepeats. This finding implies that repeat numbers are underestimated in the horseweed genome and that a de novo repeat-searching model will be needed for further analysis. The small genome size of horseweed and other economically important agricultural weeds might have a practical and evolutionary significance. Twenty-three of the 25 most-studied weedy plant species have 1C genome sizes of less than 5,000 Mb, and 13 of these have genome sizes of less than 2,000 Mb (Stewart et al., 2009). Recently, a meta-analysis was performed in which invasiveness among plant species was negatively associated with genome size and positively associated with chromosome number (Pandit et al., 2014). This study analyzed 890 species among 62 genera. Even though it represents one datum, horseweed appears to be, potentially, an archetypical weed in that it has a very streamlined genome but with a large number of genes, which could give it the capacity for rapid evolution in changing environments. Thus, beyond its use for practical agricultural research, the horseweed genome might shed light on standing questions of invasion, evolution, and changing climates (Caplat et al., 2013). The bane of farmers could be the harbinger of genomic enlightenment. MATERIALS AND METHODS Plants and Phenotyping Seeds sampled from horseweed (Conyza canadensis) populations were germinated and grown in potting medium (3B potting medium, 10-cm-diameter pot, one plant per pot; Supplemental Fig. S13) in a greenhouse under a 16-/8-h light/dark photoperiod at ambient temperatures (25°C 6 2°C). Twenty-four 3-month-old plants from each population were treated with glyphosate at the field rate (0.84 kg ha21; Roundup Weathermax; Monsanto). Treatment occurred at the rosette stage when plants were 6 to 8 cm in diameter, and glyphosatetreated plants were considered to be resistant if they were alive at the end of 3 weeks (Yuan et al., 2010). DNA Isolation and Genome Sequencing Using Multiple Platforms Genomic DNA was extracted from six individual plants of the glyphosateresistant biotype that was collected originally from a soybean field in Jackson, TN (TN-R), using Plant DNeasy kits (Qiagen). Whole-genome shotgun sequencing was performed using the Illumina HiSeq 2000, 454 GS-FLX Titanium, and PacBio RS sequencing platforms. A total of three sequencing libraries were constructed with insert sizes of approximately 600 bp for the 454 GS-FLX Titanium system. Two paired-end sequencing libraries with insert sizes of approximately 350 bp and one mate-pair library with insert sizes of approximately 3,000 bp was constructed for the Illumina HiSeq 2000 system. One library with insert sizes of approximately 10,000 bp was constructed for the PacBio RS sequencing system. De Novo Genome Assembly The de novo genome assembly strategy is shown in Supplemental Figure S1. The 454 GS-FLX reads were assembled with Newbler. The Illumina pairedend sequence reads were divided by sequencing depth (approximately 503) using error-corrected PacBio long reads as guidance (Koren et al., 2012), and subsets were assembled using SOAPdenovo and CLC Genomic Workbench. NGen was used to combine the primary assembled contigs. CLC Genomic Workbench was used to find the true mate pairs by mapping with the total contigs. NGen was used for the final scaffold construction and gap closure. All the parameter settings were described previously (Varshney et al., 2012; Wang et al., 2012). To check the completeness of the assembly, we mapped ESTs and transcriptome data to the genome assembly using BLASTN (with e value cutoff of 10220). Assembly of Chloroplast and Mitochondrial Genomes The sequencing reads were screened against custom plant chloroplast and mitochondrial genome databases. The subset sequences for chloroplast and mitochondria were assembled respectively. The complete chloroplast genome was annotated using DOGMA (Wyman et al., 2004). To remove nonmitochondrial assembly, the mitochondrial contigs were further screened against plant mitochondrial genome databases using BLASTN with e value cutoff of 1025. Gene and Repeat Annotation Gene prediction was performed using homology-based and transcriptbased methods. Previous horseweed ESTs and transcriptome data were aligned to the genome assembly using BLAT (blat-34; identity $ 0.98, coverage $ 0.98) to generate spliced alignments (Kent, 2002), which were linked according to their overlap using PASA (Haas et al., 2003). Plant proteins of other species were also mapped to the genome using TBLASTN with e value cutoff of 1025. Predicted transcripts were assigned biological information by searching the UniProt protein database using BLASTX with e value cutoff of 1023. RepeatMasker version 4.0.5 (http://www.repeatmasker.org/) was used to search putative transposable element regions of the horseweed genome against an updated set of RepeatMasker libraries (20140131). To develop SSR markers for further population genetic studies, the masked SSR loci were further analyzed using the Simple Sequence Repeat Identification Tool (http://archive.gramene.org/db/markers/ssrtool) with a threshold minimum of six repeats. GO Annotation and Enrichment Analysis Results of BLASTX of horseweed transcripts were further mapped using the GO database to assign GO terms and carry out GO annotation using the Blast2Go program integrated in CLC Genomic Workbench (http://www. blast2go.com). Enrichment analysis was performed by using the entire set of available Arabidopsis (Arabidopsis thaliana) transcripts as a reference. The P value of Fisher’s exact test was set to P # 0.001 to reduce noise and output the most specific terms. Genome Resequencing and Variation Analysis DNA samples of another seven horseweed biotypes were isolated using the method described above. Paired-end DNA libraries (approximately 350-bp insertion) were constructed for each biotype, bar coded, and subjected to Illumina sequencing using the HiSeq 2000 platform. After trimming and quality-control steps, the remaining data for each population were assembled into genomes with the TN-R assembly as reference using CLC Genomic Workbench 7.03. The probabilistic variant detection model was chosen to report the variants. A minimum cutoff value (20 times or greater coverage at the variant loci, 20% or greater frequency) was applied to avoid issues of sequencing and assembly errors in variant calling. Fisher’s exact test was carried out to detect specific variants in glyphosate-susceptible or glyphosate-resistant groups. The cutoff sample frequency was 90% or greater, which meant that only variants present in all test samples, but not in any reference samples, would be reported. To carry out phylogenetic analysis based on wholegenome single-nucleotide variation profiles among sequenced horseweed biotypes, the shrunk-genomes method of the program PhyloSNP (Faison et al., 2014) with a position delta of zero surrounding each SNP was chosen. The algorithm of a quantitative method was used, in which the created matrix of presence/absence and the position of each SNP from the shrunk-genome alignment can be used directly to generate a distance matrix among the biotypes. Finally, the tree was built by using the neighbor-joining method in Phylip 3.695. Plant Physiol. Vol. 166, 2014 1251 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. Peng et al. Supplemental Data The following materials are available in the online version of this article. Supplemental Figure S1. Data flow of de novo genome assembly. Supplemental Figure S2. Factors effect on the quality of de novo genome assembly. Supplemental Figure S3. Length and quality distribution of PacBio reads. Supplemental Figure S4. PacBio reads rescue. Supplemental Figure S5. The benefit of including PacBio long reads. Supplemental Figure S6. True and false mate-paired reads. Supplemental Figure S7. Repeats distribution of all SSR loci. Supplemental Figure S8. GO annotation. Supplemental Figure S9. Enriched GO terms in transporter subgroup. Supplemental Figure S10. Map of horseweed chloroplast genome. Supplemental Figure S11. Syntenic analysis of the horseweed chloroplast genome. Supplemental Figure S12. Horseweed mitochondrial genome. Supplemental Figure S13. Morphological variation among horseweed populations. Supplemental Figure S14. Distribution of specific SNPs. Supplemental Figure S15. Alignment of EPSPS protein sequences from different horseweed biotypes. Supplemental Table S1. Mapping of mate-paired reads. Supplemental Table S2. GC content of plant genomes. Supplemental Table S3. Summary of SSR markers. Supplemental Table S4. Summary of plant mitochondrial genomes. Supplemental Table S5. Variants within genomes among horseweed biotypes. Supplemental Data Set S1. Annotation of horseweed chloroplast genome. Supplemental Data Set S2. Horseweed mitochondrial genome sequences. ACKNOWLEDGMENTS We thank Sara Allen, Laura Abercrombie, Qidong Jia, Guanglin Li, and Nolan Kane for assistance with material sampling, sequencing, and data collection. We also thank the providers of the original sources of germplasm, including Anil Shrestha, David Grantz, Vince Davis, Robert M. Hayes, and Barbara Scott. Received July 29, 2014; accepted September 9, 2014; published September 10, 2014. LITERATURE CITED Al-Dous EK, George B, Al-Mahmoud ME, Al-Jaber MY, Wang H, Salameh YM, Al-Azwani EK, Chaluvadi S, Pontaroli AC, DeBarry J, et al (2011) De novo genome sequencing and comparative genomics of date palm (Phoenix dactylifera). Nat Biotechnol 29: 521–527 Ali A, Zinnert JC, Muthukumar B, Peng Y, Chung SM, Stewart CN Jr (2014) Physiological and transcriptional responses of Baccharis halimifolia to the explosive “composition B” (RDX/TNT) in amended soil. Environ Sci Pollut Res Int 21: 8261–8270 Al-Mssallem IS, Hu S, Zhang X, Lin Q, Liu W, Tan J, Yu X, Liu J, Pan L, Zhang T, et al (2013) Genome sequence of the date palm Phoenix dactylifera L. Nat Commun 4: 2274 Andersen MC (1993) Diaspore morphology and seed dispersal in several wind-dispersed Asteraceae. Am J Bot 80: 487–492 Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796–815 Ashworth MB, Walsh MJ, Flower KC, Powles SB (2014) Identification of the first glyphosate-resistant wild radish (Raphanus raphanistrum L.) populations. Pest Manag Sci 70: 1432–1436 Atwell S, Huang YS, Vilhjálmsson BJ, Willems G, Horton M, Li Y, Meng D, Platt A, Tarone AM, Hu TT, et al (2010) Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465: 627–631 Baerson SR, Rodriguez DJ, Tran M, Feng Y, Biest NA, Dill GM (2002) Glyphosate-resistant goosegrass: identification of a mutation in the target enzyme 5-enolpyruvylshikimate-3-phosphate synthase. Plant Physiol 129: 1265–1275 Baker M (2012) De novo genome assembly: what every biologist should know. Nat Methods 9: 333–337 Basu C, Halfhill MD, Mueller TC, Stewart CN Jr (2004) Weed genomics: new tools to understand weed biology. Trends Plant Sci 9: 391–398 Bhowmik PC, Bekech MM (1993) Horseweed (Conyza canadensis) seed production, emergence, and distribution in no-tillage and conventional tillage corn (Zea mays). Agron Trends Agric Sci 1: 67–71 Caplat P, Cheptou PO, Diez J, Guisan A, Larson BMH, MacDougall AS, Peltzer DA, Richardson DM, Shea K, van Kleunen M, et al (2013) Movement, impacts and management of plant distributions in response to climate change: insights from invasions. Oikos 122: 1265–1274 Chaisson MJ, Pevzner PA (2008) Short read fragment assembly of bacterial genomes. Genome Res 18: 324–330 Chao WS, Horvath DP, Anderson JV, Foley MP (2005) Potential model weeds to study genomics, ecology, and physiology in the 21st century. Weed Sci 53: 929–937 Collavo A, Sattin M (2012) Resistance to glyphosate in Lolium rigidum selected in Italian perennial crops: bioevaluation, management and molecular bases of target-site resistance. Weed Res 52: 16–24 Craig DW, Pearson JV, Szelinger S, Sekar A, Redman M, Corneveaux JJ, Pawlowski TL, Laub T, Nunn G, Stephan DA, et al (2008) Identification of genetic variants using bar-coded multiplexed sequencing. Nat Methods 5: 887–893 Cummins I, Wortley DJ, Sabbadin F, He Z, Coxon CR, Straker HE, Sellars JD, Knight K, Edwards L, Hughes D, et al (2013) Key role for a glutathione transferase in multiple-herbicide resistance in grass weeds. Proc Natl Acad Sci USA 110: 5812–5817 Dauer JT, Mortensen DA, Humston R (2006) Controlled experiments to predict horseweed (Conyza canadensis) dispersal distances. Weed Sci 54: 484–489 Deschamps S, Llaca V, May GD (2012) Genotyping-by-sequencing in plants. Biology (Basel) 1: 460–483 DeVlaming V, Proctor VW (1968) Dispersal of aquatic organisms: viability of seeds recovered from the droppings of captive killdeer and mallard ducks. Am J Bot 55: 20–26 Faison WJ, Rostovtsev A, Castro-Nallar E, Crandall KA, Chumakov K, Simonyan V, Mazumder R (2014) Whole genome single-nucleotide variation profile-based phylogenetic tree building methods for analysis of viral, bacterial and human genomes. Genomics 104: 1–7 Feng PCC, Tran M, Chiu T, Sammons RD, Heck GR, Jacob CA (2004) Investigations into glyphosate-resistant horseweed (Conyza canadensis): retention, uptake, translocation, and metabolism. Weed Sci 52: 498–505 Gaines TA, Lorentz L, Figge A, Herrmann J, Maiwald F, Ott MC, Han H, Busi R, Yu Q, Powles SB, et al (2014) RNA-Seq transcriptome analysis to identify genes involved in metabolism-based diclofop resistance in Lolium rigidum. Plant J 78: 865–876 Gaines TA, Wright AA, Molin WT, Lorentz L, Riggins CW, Tranel PJ, Beffa R, Westra P, Powles SB (2013) Identification of genetic elements associated with EPSPs gene amplification. PLoS ONE 8: e65819 Gaines TA, Zhang W, Wang D, Bukun B, Chisholm ST, Shaner DL, Nissen SJ, Patzoldt WL, Tranel PJ, Culpepper AS, et al (2010) Gene amplification confers glyphosate resistance in Amaranthus palmeri. Proc Natl Acad Sci USA 107: 1029–1034 Ge X, d’Avignon DA, Ackerman JJH, Collavo A, Sattin M, Ostrander EL, Hall EL, Sammons RD, Preston C (2012) Vacuolar glyphosatesequestration correlates with glyphosate resistance in ryegrass (Lolium spp.) from Australia, South America, and Europe: a 31P NMR investigation. J Agric Food Chem 60: 1243–1250 Ge X, d’Avignon DA, Ackerman JJH, Sammons D (2014) In vivo 31Pnuclear magnetic resonance studies of glyphosate uptake, vacuolar sequestration, and tonoplast pump activity in glyphosate-resistant horseweed. Plant Physiol 166: 1255–1268 1252 Plant Physiol. Vol. 166, 2014 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. De Novo Genome Assembly of Horseweed Ge X, d’Avignon DA, Ackerman JJH, Sammons RD (2010) Rapid vacuolar sequestration: the horseweed glyphosate resistance mechanism. Pest Manag Sci 66: 345–348 Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92–100 Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, et al (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31: 5654–5666 Halfhill MD, Good LL, Basu C, Burris J, Main CL, Mueller TC, Stewart CN Jr (2007) Transformation and segregation of GFP fluorescence and glyphosate resistance in horseweed (Conyza canadensis) hybrids. Plant Cell Rep 26: 303–311 Heap I (2014) The International Survey of Herbicide Resistant Weeds. www.weedscience.com (May 2014) Huang X, Feng Q, Qian Q, Zhao Q, Wang L, Wang A, Guan J, Fan D, Weng Q, Huang T, et al (2009) High-throughput genotyping by wholegenome resequencing. Genome Res 19: 1068–1076 Huang X, Wei X, Sang T, Zhao Q, Feng Q, Zhao Y, Li C, Zhu C, Lu T, Zhang Z, et al (2010) Genome-wide association studies of 14 agronomic traits in rice landraces. Nat Genet 42: 961–967 International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436: 793–800 Iwakami S, Uchino A, Kataoka Y, Shibaike H, Watanabe H, Inamura T (2014) Cytochrome P450 genes induced by bispyribac-sodium treatment in a multiple-herbicide-resistant biotype of Echinochloa phyllopogon. Pest Manag Sci 70: 549–558 Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, et al (2007) The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449: 463–467 Jugulam M, Niehues K, Godar AS, Koo DH, Danilova T, Friebe B, Sehgal S, Varanasi VK, Wiersma A, Westra P, et al (2014) Tandem amplification of a chromosomal segment harboring EPSPS locus confers glyphosate resistance in Kochia scoparia. Plant Physiol 166: 1200–1207 Kane NC, Gill N, King MG, Bowers JE, Berges H, Gouzy J, Rieseberg LH (2011) Progress towards a reference genome for sunflower. Botany 89: 429–437 Kent WJ (2002) BLAT: the BLAST-like alignment tool. Genome Res 12: 656–664 Koger CH, Reddy KN (2005) Role of absorption and translocation in the mechanism of glyphosate resistance in horseweed (Conyza canadensis). Weed Sci 53: 84–89 Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, McCombie WR, Jarvis ED, et al (2012) Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 30: 693–700 Kump KL, Bradbury PJ, Wisser RJ, Buckler ES, Belcher AR, OropezaRosas MA, Zwonitzer JC, Kresovich S, McMullen MD, Ware D, et al (2011) Genome-wide association study of quantitative resistance to southern leaf blight in the maize nested association mapping population. Nat Genet 43: 163–168 Lai Z, Kane NC, Kozik A, Hodgins KA, Dlugosch KM, Barker MS, Matvienko M, Yu Q, Turner KG, Pearl SA, et al (2012) Genomics of Compositae weeds: EST libraries, microarrays, and evidence of introgression. Am J Bot 99: 209–218 Lee RM, Thimmapuram J, Thinglum KA, Gong G, Hernandez AG, Wright CL, Kim RW, Mikel M, Tranel PJ (2009) Sampling the waterhemp (Amaranthus tuberculatus) genome using pyrosequencing technology. Weed Sci 57: 463–469 Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, et al (2010a) The sequence and de novo assembly of the giant panda genome. Nature 463: 311–317 Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, et al (2010b) De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20: 265–272 Main CL, Mueller TC, Hayes RM, Wilkerson JB (2004) Response of selected horseweed (Conyza canadensis (L.) Cronq.) populations to glyphosate. J Agric Food Chem 52: 879–883 Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380 Mueller TC, Massey JH, Hayes RM, Main CL, Stewart CN Jr (2003) Shikimate accumulates in both glyphosate-sensitive and glyphosateresistant horseweed (Conyza canadensis L. Cronq.). J Agric Food Chem 51: 680–684 Okada M, Hanson BD, Hembree KJ, Peng Y, Shrestha A, Stewart CN Jr, Wright SD, Jasieniuk M (2013) Evolution and spread of glyphosate resistance in Conyza canadensis in California. Evol Appl 6: 761–777 Owen MD, Zelaya IA (2005) Herbicide-resistant crops and weed resistance to herbicides. Pest Manag Sci 61: 301–311 Pandit MK, White SM, Pocock MJO (2014) The contrasting effects of genome size, chromosome number and ploidy level on plant invasiveness: a global analysis. New Phytol 203: 697–703 Paterson AH, Freeling M, Tang H, Wang X (2010) Insights from the comparison of plant genome sequences. Annu Rev Plant Biol 61: 349–372 Peng Y, Abercrombie LLG, Yuan JS, Riggins CW, Sammons RD, Tranel PJ, Stewart CN Jr (2010) Characterization of the horseweed (Conyza canadensis) transcriptome using GS-FLX 454 pyrosequencing and its application for expression analysis of candidate non-target herbicide resistance genes. Pest Manag Sci 66: 1053–1062 Pimentel D, Lach L, Zuniga R, Morrison D (2000) Environmental and economic costs of nonindigenous species in the United States. Bioscience 50: 53–65 Preston C, Wakelin AM (2008) Resistance to glyphosate from altered herbicide translocation patterns. Pest Manag Sci 64: 372–376 Riar DS, Norsworthy JK, Johnson DB, Scott RC, Bagavathiannan M (2011) Glyphosate resistance in a johnsongrass (Sorghum halepense) biotype from Arkansas. Weed Sci 59: 299–304 Riggins CW, Peng Y, Stewart CN Jr, Tranel PJ (2010) Characterization of waterhemp transcriptome using 454 pyrosequencing and its application for studies of herbicide target-site genes. Pest Manag Sci 66: 1042–1052 Rojano-Delgado AM, Cruz-Hipolito H, De Prado R, Luque de Castro MD, Franco AR (2012) Limited uptake, translocation and enhanced metabolic degradation contribute to glyphosate tolerance in Mucuna pruriens var. utilis plants. Phytochemistry 73: 34–41 Sammons RD, Gaines TA (2014) Glyphosate resistance: state of knowledge. Pest Manag Sci 70: 1367–1377 Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, et al (2010) Genome sequence of the palaeopolyploid soybean. Nature 463: 178–183 Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, et al (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326: 1112–1115 Seabury CM, Dowd SE, Seabury PM, Raudsepp T, Brightsmith DJ, Liboriussen P, Halley Y, Fisher CA, Owens E, Viswanathan G, et al (2013) A multi-platform draft de novo genome assembly and comparative analysis for the scarlet macaw (Ara macao). PLoS ONE 8: e62415 Shaner DL (2009) the role of translocation as a mechanism of resistance to glyphosate. Weed Sci 57: 118–123 Shields EJ, Dauer JT, VanGessel MJ, Neumann G (2006) Horseweed (Conyza canadensis) seed collected in the planetary boundary layer. Weed Sci 54: 1063–1067 Sterck L, Rombauts S, Vandepoele K, Rouzé P, Van de Peer Y (2007) How many genes are there in plants (...and why are they there)? Curr Opin Plant Biol 10: 199–203 Stewart CN Jr, editor (2009) Weedy and Invasive Plant Genomics. WileyBlackwell, Ames, IA Stewart CN Jr, Tranel PJ, Horvath DP, Anderson JV, Rieseberg LH, Westwood JH, Mallory-Smith CA, Zapiola ML, Dlugosch KM (2009) Evolution of weediness and invasiveness: charting the course for weed genomics. Weed Sci 57: 451–462 Tian X, Zheng J, Hu S, Yu J (2006) The rice mitochondrial genomes and their variations. Plant Physiol 140: 401–410 Todesco M, Balasubramanian S, Hu TT, Traw MB, Horton M, Epple P, Kuhns C, Sureshkumar S, Schwartz C, Lanz C, et al (2010) Natural allelic variation underlying a major fitness trade-off in Arabidopsis thaliana. Nature 465: 632–636 Truco MJ, Ashrafi H, Kozik A, van Leeuwen H, Bowers J, Wo SRC, Michelmore RW (2013) An ultra-high-density, transcript-based, genetic map of lettuce. G3 (Bethesda) 3: 617–631 Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A, et al (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313: 1596–1604 Plant Physiol. Vol. 166, 2014 1253 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. Peng et al. VanGessel MJ (2001) Glyphosate-resistant horseweed from Delaware. Weed Sci 49: 703–705 Varshney RK, Chen W, Li Y, Bharti AK, Saxena RK, Schlueter JA, Donoghue MT, Azam S, Fan G, Whaley AM, et al (2012) Draft genome sequence of pigeonpea (Cajanus cajan), an orphan legume crop of resource-poor farmers. Nat Biotechnol 30: 83–89 Varshney RK, Song C, Saxena RK, Azam S, Yu S, Sharpe AG, Cannon S, Baek J, Rosen BD, Tar’an B, et al (2013) Draft genome sequence of chickpea (Cicer arietinum) provides a resource for trait improvement. Nat Biotechnol 31: 240–246 Vila-Aiub MM, Balbi MC, Distéfano AJ, Fernández L, Hopp E, Yu Q, Powles SB (2012) Glyphosate resistance in perennial Sorghum halepense (johnsongrass), endowed by reduced glyphosate translocation and leaf uptake. Pest Manag Sci 68: 430–436 Wang Z, Hobson N, Galindo L, Zhu S, Shi D, McDill J, Yang L, Hawkins S, Neutelings G, Datla R, et al (2012) The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads. Plant J 72: 461–473 Warwick SI, Stewart CN Jr (2005) Crops come from wild plants: how domestication, transgenes, and linkage effects shape ferality. In J Gressel, ed, Crop Ferality and Volunteerism. CRC Press, Boca Raton, FL, pp 9–30 Weaver SE (2001) The biology of Canadian weeds. 115. Conyza canadensis. Can J Plant Sci 81: 867–875 Weaver SE, McWilliams EL (1980) The biology of Canadian weeds. 44. Amaranthus retroflexus L., A. powellii S. Wats. and A. hybridus L. Can J Plant Sci 60: 1215–1234 Wu X, Ren C, Joshi T, Vuong T, Xu D, Nguyen HT (2010) SNP discovery by high-throughput sequencing in soybean. BMC Genomics 11: 469 Wyman SK, Jansen RK, Boore JL (2004) Automatic annotation of organellar genomes with DOGMA. Bioinformatics 20: 3252–3255 Yu J, Hu S, Wang J, Wong GKS, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 79–92 Yuan JS, Abercrombie LG, Cao Y, Halfhill MD, Zhou X, Peng Y, Hu J, Rao MR, Heck GR, Larosa TJ, et al (2010) Functional genomics analysis of glyphosate resistance in Conyza canadensis (horseweed). Weed Sci 58: 109–117 Yuan JS, Tranel PJ, Stewart CN Jr (2007) Non-target-site herbicide resistance: a family business. Trends Plant Sci 12: 6–13 Zelaya IA, Owen MDK, Vangessel MJ (2004) Inheritance of evolved glyphosate resistance in Conyza canadensis (L.) Cronq. Theor Appl Genet 110: 58–70 Zhou X, Su Z, Sammons RD, Peng Y, Tranel PJ, Stewart CN Jr, Yuan JS (2009) Novel software package for cross-platform transcriptome analysis (CPTRA). BMC Bioinformatics (Suppl 11) 10: S16 1254 Plant Physiol. Vol. 166, 2014 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2014 American Society of Plant Biologists. All rights reserved. CORRECTIONS Vol. 166: 1241–1254, 2014 Peng Y., Lai Z., Lane T., Nageswara-Rao M., Okada M., Jasieniuk M., O’Geen H., Kim R.W., Sammons R.D., Rieseberg L.H., and Stewart C.N. Jr. De Novo Genome Assembly of the Economically Important Weed Horseweed Using Integrated Data from Multiple Sequencing Platforms. On p. 1243 of this article, the assembled draft of horseweed genome data submitted to the National Center for Biotechnology Information is incorrectly listed as having the accession number SUB535309. The correct accession number for the Whole Genome Shotgun project deposited at DDBJ/ EMBL/GenBank is JSWR00000000. www.plantphysiol.org/cgi/doi/10.1104/pp.114.900500 2218 Plant PhysiologyÒ, December 2014, Vol. 166, p. 2218, www.plantphysiol.org Ó 2014 American Society of Plant Biologists. All Rights Reserved.
© Copyright 2026 Paperzz