Plant Cell Advance Publication. Published on March 28, 2016, doi:10.1105/tpc.15.00933 1 LARGE-SCALE BIOLOGY ARTICLE 2 xGDBvm: A Web GUI-driven workflow for annotating eukaryotic genomes in the 3 cloud 4 Jon Duvick1 5 Daniel S. Standage2 6 Nirav Merchant3 7 Volker P. Brendel4,5 8 Author Affiliations 9 1. 10 Department of Genetics, Development and Cell Biology, Iowa State University, Ames, Iowa 50011 USA 11 2. Department of Biology, Indiana University, Bloomington, Indiana 47405 USA 12 3. Bio Computing Facility, University of Arizona, Tucson, Arizona 85721 USA 13 4. Department of Biology and School of Informatics & Computing, Indiana 14 15 16 17 18 19 20 21 22 23 University, Bloomington, Indiana 47405 USA 5. Corresponding author. E-mail: [email protected] The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantcell.org) is: Volker P. Brendel. Synopsis: xGDBvm is a novel tool for scalable, reproducible, and expandable genome annotation via Web-based interfaces that seamlessly integrate background cloudbased data storage and high-performance computer resources. 1 ©2016 American Society of Plant Biologists. All Rights Reserved. 24 ABSTRACT 25 Genome-wide annotation of gene structure requires the integration of numerous 26 computational steps. Currently, annotation is arguably best accomplished through 27 collaboration of bioinformatics and domain experts, with broad community involvement. 28 However, such a collaborative approach is not scalable at today’s pace of sequence 29 generation. To address this problem, we developed the xGDBvm software, which uses an 30 intuitive graphical user interface (GUI) to access a number of common genome analysis 31 and gene structure tools, preconfigured in a self-contained virtual machine image. Once 32 their virtual machine instance is deployed through iPlant’s Atmosphere cloud services, 33 users access the xGDBvm workflow via a unified Web interface to manage inputs, set 34 program parameters, configure links to high performance computing (HPC) resources, 35 view and manage output, apply analysis and editing tools, or access contextual help. The 36 xGDBvm workflow will mask the genome, compute spliced alignments from transcript 37 and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and 38 gene structure quality, and display output in a public or private genome browser complete 39 with accessory tools. Problematic gene predictions are flagged and can be re-annotated 40 using the integrated yrGATE annotation tool. xGDBvm can also be configured to append 41 or replace existing data or load pre-computed data. Multiple genomes can be annotated 42 and displayed, and outputs can be archived for sharing or backup. xGDBvm can be 43 adapted to a variety of use cases including de novo genome annotation, re-annotation, 44 comparison of different annotations, and training or teaching. 45 2 46 INTRODUCTION 47 The number of sequenced eukaryotic genomes is increasing rapidly due to advances in 48 sequencing technology and cost-effectiveness; for recent lists see https://gold.jgi.doe.gov 49 (Reddy et al., 2015) and http://www.diark.org/diark (Hammesfahr et al., 2011). However, 50 the pace of data acquisition leads to bottlenecks at both assembly and annotation stages, 51 before the sequence data can be consumed for research. In particular, annotating a novel 52 genome is often challenging due to our incomplete knowledge of what constitutes a gene 53 across a wide range of species, meaning that ab initio gene prediction, although useful, is 54 inadequate (Yandell and Ence, 2012). Full genome annotation typically consists of at 55 minimum: 1) optionally repeat masking the genome; 2) splice-aligning transcripts and 56 proteins from related species for evidence-based gene structure prediction; 3) using ab 57 initio gene finding algorithms to annotate possible gene structures; 4) combining the 58 above data sources to create a set of possible gene structures; and 5) filtering the results 59 through quality and/or similarity filters to find the most probable set of structures that 60 represent full-length or near-full-length coding genes. As a result, genome annotation is 61 necessarily a time-consuming and computationally intensive process that combines 62 numerous types of sequence analysis and heuristic prediction, typically relying on well- 63 annotated genomes as a reference and typically resulting in a far-from-perfect (but 64 arguably useful) draft annotation. A number of groups have published complete 65 computational pipelines for eukaryotic genome annotation (Mungall et al., 2002; Potter et 66 al., 2004; Uberbacher et al., 2004; Cantarel et al., 2008; Foissac et al., 2008; Holt and 67 Yandell, 2011; Specht et al., 2011; Grigoriev et al., 2012; Leroy et al., 2012; Thibaud- 68 Nissen et al., 2013; Hoff et al., 2016) However, these pipelines require considerable 69 expertise to install, configure, troubleshoot, and manage. We propose that a ‘turnkey’ 70 genome annotation system could greatly benefit researchers who desire a credible draft 71 genome annotation to facilitate further research, as well as foster comparative genomics, 72 as early as possible in the life of their project. Among the desirable attributes of such a 73 system would be the following: 74 Easy to configure. An annotation workflow will necessarily combine a wide range of 75 computational tools whose successful configuration and interoperability would be 3 76 challenging for the non-specialist, so ideally it should be available as a precompiled 77 package. A common method for packaging and distributing such a complex system is via 78 a virtual machine or VM, which encapsulates the underlying server operating system, the 79 application software components along with all requisite software dependencies, and 80 configuration settings, all of which are stored (“imaged”) in such a way that they can be 81 copied and launched by means of commonly available virtualization tools and made 82 available to anyone with access to virtual server software such as KVM 83 (http://www.linux-kvm.org) or VirtualBox (https://www.virtualbox.org). VMs have a 84 number advantages for complex informatics analysis (Nocq et al., 2013), of which the 85 pre-installation of all required software for complex tasks as well as temporary access to 86 all the computer resources needed for completion of the task are of most practical value 87 for a typical biologist user. Cloud computing platforms such as OpenStack 88 (https://www.openstack.org) and Docker (https://www.docker.com) offer VM and 89 container-based technologies that can be managed, accessed remotely, and readily 90 deployed on commercial cloud-based services such as Amazon Web Services 91 (https://aws.amazon.com). Government-funded consortia such as the iPlant Collaborative 92 (now CyVerse) (Goff et al., 2011) make such virtual platforms readily accessible to 93 individual users via the internet. 94 Easy to use. Although most genome researchers are familiar with a wide range of online 95 tools to evaluate sequence data, they will not necessarily know how to put them together 96 and configure them appropriately. Ideally, an annotation platform should have a cohesive 97 GUI that guides the user through setup, configuration, parameter setting and status 98 reporting. Importantly, all setup and processing steps should be managed with data sanity 99 checks (for completeness and format), context-dependent menus, error logging and 100 reporting, and help documentation/tutorials. 101 Editable. Ability to edit and improve automated annotation should be built in. This 102 means the ability both to add additional data once the workflow has completed and to 103 modify individual annotations in such a way that the most critical regions of the genome 104 are well annotated. 4 105 Reproducible. With variable parameters and source datasets, automated documentation 106 and simple archiving are essential for ensuring repeatability of the genome annotation 107 process. 108 Scalable. With large genomes and large transcript datasets, computations such as spliced 109 alignment can take days or weeks on a typical lab computer, whereas with access to HPC 110 resources the process can be completed in a few hours. Many research facilities have 111 such resources, but their use is complex and not necessarily available to any researcher 112 who might be interested. 113 Publishable. Once computation is complete, the annotated genome and its input/output 114 files should be available online either to a select community (with password access) or to 115 the research community at a whole, thus placing output data and/or community 116 annotation tools in the hands of the target audience in a timely manner. 117 With the above attributes in mind, we created a self-contained genome annotation 118 platform, xGDBvm, for use by the research community. We report below our initial 119 release of xGDBvm in the iPlant (CyVerse) Atmosphere cloud infrastructure 120 (http://www.iplantcollaborative.org/ci/atmosphere) as an on-demand virtual server for 121 genome annotation that can be adapted for wide range of research needs. 122 123 5 124 RESULTS 125 Overview of xGDBvm 126 xGDBvm is a Linux-based platform that accepts genomic and transcript and/or protein 127 sequence inputs and creates a genome annotation that can be displayed in the included, 128 full-featured genome browser, with separate tracks for genome segments, transcript and 129 protein alignments, gene predictions, and repeat masked regions (Fig. 1). xGDBvm uses a 130 modified and extended version of the xGDB (Extensible Genome Data Broker) Web 131 platform (Schlueter et al., 2006) written in Perl and PHP, along with a Web server, 132 workflow automation scripts, and executables packaged together as a virtual server and 133 configured for access over HTTP or HTTPS via a graphical user interface (GUI). 134 xGDBvm is compact in size, occupying approximately 13 Gigabytes (GB) of a typical 20 135 GB VM root partition. Data inputs/outputs are preferably stored on external volumes 136 mounted to the VM, thus alleviating constraints on VM size. 137 Computational processes in xGDBvm (Fig. 2) are managed by automated, user- 138 configurable workflows, with a built-in option for calls to HPC resources. Optional 139 masking of genome segments is carried out using Vmatch (Abouelhoda et al., 2002) 140 based on user-provided masking libraries. Spliced alignment of transcripts and proteins to 141 the genome are computed using GeneSeqer (Usuka et al., 2000) and GenomeThreader 142 (Gremme et al., 2005) respectively. xGDBvm optionally creates gene model predictions 143 using CpGAT (Comprehensive Gene Annotation Tool; http://plantgdb.org/AtGDB/cgi- 144 bin//WebCpGAT.pl), a set of scripts and binaries that integrates spliced-alignment data 145 and ab initio gene predictions along with BLAST similarity filters and alternative 146 structures to derive a high-quality gene prediction dataset. The xGDBvm workflow can 147 also upload pre-computed gene predictions from a user-provided GFF3-formatted file. 148 All steps are logged and displayed dynamically during workflow operation. Once 149 complete, each feature is displayed as a separate track in a fully featured genome browser 150 complete with search/download tools and tabular feature views. A quality score assigned 151 to each annotated locus facilitates the identification of low-quality models, which can 152 then be re-annotated and curated using the built-in yrGATE annotation tool (Wilkerson et 6 Figure 1. Overview of xGDBvm as implemented at CyVerse (iPlant). xGDBvm is a virtual server environment for gene structure annotation that can be cloned, configured, populated with input data, and run from a Web browser in a few steps, summarized here: A. Log in to the CyVerse Atmosphere Control Panel (https:// atmo.iplantcollaborative.org/application) (1) and click to create a new instance (cloned copy) of xGDBvm (2), create a block storage volume,for output data, and attach it to the instance (3). Open a Web shell interface (4), accessible from the Control Panel, and type a series of commands to set up and configure the new xGDBvm instance, also mounting the Data Store and the attached volume. B. Log in to the CyVerse Data Store cloud storage system (https://de.iplantcollaborative.org/de/) and upload input data files to an input data directory (accessible to the VM) using a batch uploading tool. Naming conventions are used to identify each input type. C. Log in to the xGDBvm instance’s Graphical User Interface (GUI) using HTTPS via the VM’s unique IP address or using a Virtual Network Client (VNC). All subsequent steps are carried out using the xGDBvm GUI. Authorize the VM to connect to remote HPC resources via the Agave API (http://agaveapi.co) (2). Configure the path to Data Store inputs and set other parameters including remote job execution (optional). xGDBvm will validate files, return expected outputs and flag any input file errors (3). Initiate automated workflows and monitor progress (4). The workflow sends some data remotely for processing on High Performance Computing (HPC) resources (https://www.xsede.org/) managed by Agave APIs, and processes other files locally using the attached volume as a scratch disk. The xGDBvm workflow waits for HPC outputs, then proceeds with the annotation process. Output data are written to the external volume and can be accessed from xGDBvm Web browser as GDB001, GDB002, etc. (5). In addition to a fully featured genome browser, xGDBvm includes tools to query, update, reannotate, download, or archive outputs to the user’s Data Store. For details, refer to the xGDBvm wiki (http://goblinx.soic.indiana.edu/wiki/doku.php). 153 al., 2006). Additional genomes can be configured and created with the same VM, and the 154 user can archive and retrieve single or global datasets. Any data type can be appended or 155 replaced using an ‘Update’ feature. The outcome is a rich, editable environment for 7 Figure 2. Data process schema. Input data types (with standardized names as indicated), computational modules, and outputs are shown. Images are screenshots of color-coded track glyph types (gene models; splice alignments) and track flags (quality scores) displayed in the xGDBvm genome browser. Not shown: xGDBvm can also display unknown sequence or repeatmasked regions as a grey bar. See text for details. 156 genome exploration and annotation, accessible locally or remotely on the Web (see Table 157 1 for feature overview). 158 xGDBvm-iPlant 8 159 We implemented xGDBvm as a VM image on iPlant’s Atmosphere cloud platform 160 (https://atmo.iplantcollaborative.org/application), available to registered life sciences 161 researchers (see http://www.iplantcollaborative.org/content/acceptable-use-policy). We 162 further customized the VM taking advantage of iPlant’s data and job execution APIs, 163 making xGDBvm a one-stop destination for genome annotation and display. Registered 164 iPlant users can create and configure an xGDBvm instance via the Atmosphere control 165 panel, and then access the xGDBvm instance via a Web browser to perform all 166 subsequent tasks: validate inputs, run HPC jobs, initiate local workflows, check progress, 167 and view/edit the resulting genome annotation. The genome browser(s) can be made 168 public or private as desired. The following sections detail xGDBvm’s functionality in its 169 current version on iPlant Atmosphere. 170 Inputs and data processing 171 Fig. 3 diagrams the modular architecture used by xGDBvm at iPlant. For managing 172 inputs, 173 (http://www.iplantcollaborative.org/ci/data-store), which provides high capacity storage 174 and tools for quickly uploading user data files. During the xGDBvm configuration 175 process, the user’s Data Store home directory is mounted to the VM’s file system using 176 IRODS FUSE (http://irods.org) and files uploaded to the Data Store are thus accessible 177 on the VM using Unix file system commands. For output data (alignment files, GFF3 178 files, sequence indexes, MySQL database tables, configuration files, and archives), the 179 user can attach a block storage volume to the VM via the Atmosphere control panel, and 180 mount it to the VM’s file system. This data partitioning strategy has the advantage that all 181 data outputs are separate from the VM and do not consume its limited storage capacity 182 while at the same time providing scalability as the data transfer for HPC jobs occurs 183 directly with the data store. Moreover, the complete xGDBvm display can be 184 reconstituted by mounting the volume to a new xGDBvm instance, useful in the event a 185 VM becomes unavailable. 186 Managing files and ensuring validity of inputs (sanity checks) is a challenge for 187 computational pipelines where multiple inputs of various types and formats may be used. xGDBvm uses iPlant’s Data Store cloud storage service 9 Figure 3. xGDBvm architecture. An xGDBvm instance, hosted on CyVerse’s Atmosphere cloud infrastructure (https://atmo.iplantcollaborative.org/application), has separate file system partitions under root (containing the xGDBvm Web GUI, scripts and binaries, and other software) and /home/ (which is configured with mount points for the user’s Data Store home directory for data input and a block storage volume for data output). The Agave API, hosted by the CyVerse Discovery Environment, is used for authentication of the VM via OAuth2 and for management of High Performance Computing applications and job submission. A key feature of xGDBvm is the ability to attach and mount the output volume to a different VM and reconstitute the annotation outputs and display. See text for details. 188 xGDBvm makes use of filename standardization and extensive validation tools to reduce 189 the incidence of input errors. Each input file is required to be named according to its data 190 type and file format, e.g. ~est.fa for a FASTA file of EST sequences, where “~” is any 191 user prefix, and all input files are placed in a single directory whose path is saved as a 192 configuration variable. In addition, output files (including copies of input files) are all 193 named according to the same conventions, with the GDB number as a prefix, e.g. 194 GDB001est.fa, and deposited in subdirectories according to their type/process. Once an 195 input path has been specified, xGDBvm displays valid filenames in the input directory 196 according to type, displays predicted output tracks, and alerts to any missing files that 197 would compromise output. The user then initiates a script to validate sequence deflines 198 (description lines), error-check IDs and enumerate file contents either singly or in batch 10 199 mode (see Supplemental Fig. 1). File validity metadata are stored along with a unique file 200 stamp, so files need only be validated once unless modified. 201 Supplemental Fig. 2 shows the complete, automated workflow for creating and updating 202 a genome annotation. Typical inputs include a genome sequence assembly and a set of 203 transcript sequences – EST, cDNA, or short read/transcript assembly (TSA) – and/or 204 predicted protein sequences, in FASTA format. Depending on availability, transcripts 205 may be from the same or a closely-related species (Wang et al., 2008). Protein sequences 206 should be from a well-characterized genome as close as possible taxonomically to the 207 target species. With transcript (EST, cDNA or TSA) inputs, xGDBvm will compute 208 spliced alignments, according to user-specified or default parameters, using the 209 multithreaded GeneSeqer-MPI spliced alignment program (Usuka et al., 2000) installed 210 locally or on an HPC server with up to 128 cores. For this step, the user can opt to apply 211 repeat masking to the genome sequence using vmktree/vmatch (Abouelhoda et al., 2002) 212 to reduce computation time, with inclusion of a suitable repeat mask sequence library. 213 Alternatively, the user can provide an N-masked genome file as input. For related-species 214 protein inputs, xGDBvm computes spliced alignments using the GenomeThreader 215 program (Gremme et al., 2005) either locally or on an HPC server. Spliced alignments 216 that meet a quality threshold are ultimately displayed in the xGDBvm genome browser as 217 discrete tracks with standard box-line glyphs to indicated exon/intron boundaries (Fig. 2). 218 The user can also provide GeneSeqer and/or GenomeThreader output files, created 219 offline, as inputs, bypassing the above steps. 220 The xGDBvm workflow next uses spliced alignment data as input for CpGAT, which 221 assembles gene model predictions for the genome. CpGAT uses EVM (EVidence 222 Modeler; http://evidencemodeler.github.io) (Haas et al., 2008) to evaluate GeneSeqer 223 transcript alignments and/ or GenomeThreader protein spliced alignments, together with 224 ab initio gene finder results from BGF (http://bgf.genomics.org.cn), GeneMark 225 (http://exon.gatech.edu/GeneMark/) 226 (http://bioinf.uni-greifswald.de/augustus/) (Stanke et al., 2006), and derives an optimal 227 set of transcript models which are then BLASTed against a reference protein dataset (if 228 supplied by the user). In addition, some PASA (Haas et al., 2003) functions are used to (Borodovsky et al., 2003), and Augustus 11 229 aggregate splice variant models where indicated by evidence alignments. Optionally the 230 user can request repeat masking of the genome prior to ab initio gene prediction. The 231 output from CpGAT is a set of BLAST-filtered or unfiltered gene model structures for 232 each genome segment, complete with coordinates for start/stop codon and predicted 233 UTRs where possible, in GFF3 format, which are loaded to the xGDBvm database. 234 Several CpGAT parameters are user-configurable with the xGDBvm GUI, allowing the 235 user to select species model or bypass ab initio gene finders, relax reference protein 236 BLAST filtering, or request repeat masking, and the complete set of CpGAT parameters 237 can be modified by editing the CpGAT configuration file. 238 As a final step, xGDBvm calculates the GAEVAL score for each gene model, consisting 239 of a set of statistics representing the degree of congruence of the model with available 240 alignment evidence (http://plantgdb.org/GAEVAL/docs/index.html). GAEVAL also 241 reports alternative splicing evidence and classifies annotation errors into discrete types 242 such as gene fusion, gene fission, etc. GAEVAL data summaries are displayed in 243 xGDBvm as a flag associated with each track glyph (Schlueter et al., 2005). 244 Users can also upload pre-computed genome annotations provided as GFF3 file(s) along 245 with optional transcript and translation FASTA files. These data are displayed in the form 246 of a separate annotation track, with GAEVAL scores calculated as described above. If 247 gene descriptions are available in tabular form, these can also be uploaded to augment 248 gene annotation tracks. 249 250 251 xGDBvm setup, configuration and data processing 252 xGDBvm was designed to be easy to configure and run (Fig. 1). As a supplement to 253 online help and video tutorials (see below), beginning users can consult the xGDBvm 254 wiki (http://goblinx.soic.indiana.edu/wiki/doku.php) which includes step-by-step 12 255 instructions and information about how to choose the correct VM size and storage 256 capacity for their particular genome annotation needs. 257 After instance creation, the user accesses the shell via a terminal emulator or the 258 Atmosphere’s built-in shell emulator and types a series of simple commands to configure 259 and password-protect the VM environment. Subsequent steps are accomplished using a 260 Web browser connecting to the VM via HTTPS, or by connecting to the VM using a 261 virtual network computing (VNC) client (Atmosphere offers a built-in VNC window as 262 well). xGDBvm’s hierarchical user interface is organized by task type: Manage, View, 263 Annotate, and Help, with submenus under each section. Under Manage are Admin 264 (manage site passwords, admin emails, and yrGATE users); Configure/Create (create or 265 update a genome browser); and Remote Jobs (configure and manage remote HPC jobs; 266 see next section). End-user oriented sections include View (browse/analyze genomes), 267 and Annotate (submit/manage user annotations). Each section and subsection includes a 268 Getting Started page that outlines the suggested workflow along with key links and one 269 or more Help pages with detailed documentation including video tutorials that can be 270 viewed on the VM. Contextual popup help dialogs are also provided for each page/step. 271 Under Manage → Configure/Create, a user can check volume capacity of the VM, 272 manage license keys for certain installed software and then consult a decision tree to 273 guide them to the correct data sources, a table of filename conventions, and a guide to 274 CpGAT annotation. Once the data files are in place, the user clicks ‘Create New GDB’, 275 selects a file path pointing to the data input files, enters any non-default parameters as 276 well as genome metadata, and then saves the configuration setup, which is assigned 277 ‘Development’ status and an ID (GDB001, etc.) that will be associated with the output 278 database (Fig. 4A). The user can now click to validate file contents as described above. 279 To initiate data processing, the user selects ‘Data Process Options’ followed by ‘Create 280 GDB’, which changes status to ‘Locked’, initiates the central data processing workflow, 281 and displays a running report of progress together with any errors. The workflow can be 282 aborted at any time by clicking the ‘Abort’ button under ‘Data Process Options’; this 283 removes all dynamically created directories and kills all associated processes, returning 284 the configuration to ‘Development’ status. On successful workflow completion, GDB 13 Figure 4. xGDBvm data management. A. Screenshot of the GDB Configuration GUI, set up for processing Example data. Each genome annotation is assigned a unique identifier (GDB001, GDB002, etc.) and a user-provided name. In addition to form fields for input data path, annotation parameters, and metadata, this page provides extensive colorcoded information about all system settings (e.g. license keys, storage capacity, login status, displayed in blue-green), input data validity (light green), and expected output (orange). The GUI includes buttons that launch modal windows to initiate computational workflow or edit configuration. B. Screenshot of Archive/Delete GUI, showing genome databases with ‘Current’ (blue; computation complete) or ‘Development’ (grey; not yet run) status. Each table row displays information about a GDB including time stamps as well as action buttons that allow the user to Drop, Delete, Archive, Delete Archive, or Copy database (see text for details). Global action buttons (top right) allow the user to Delete or Archive all data on the VM. C. Screenshot of List All Jobs GUI with tools to monitor and manage remote HPC jobs. The GUI displays IDs, job metadata, time stamps, color-coded status indicators and action buttons to manage output (Stop Job, Delete Job, View Logs, Copy Output) via the Agave API; see text for details. 285 status is changed to ‘Current’ and the new genome is added to the View menu structure. 286 Input datasets, annotation statistics, and output datasets can be viewed online. Output 14 287 errors are logged and displayed to the user along with context-specific help dialogs 288 (Supplemental Fig. 3). 289 Any of several lightweight, preconfigured sample datasets (Supplemental Fig. 4) can be 290 loaded with a single button click from the ‘Create New’ page and then saved and 291 processed to a finished GDB in no more than a few minutes. Because these examples 292 cover the complete range of processes and workflows in the xGDBvm code, they also 293 serve as functional tests for functionality when first setting up an xGDBvm instance or 294 modifying its code. 295 High-performance computing option 296 On multi-processor VMs, xGDBvm automatically invokes parallel processing where 297 possible, for certain computational steps (See Supplemental Figure 1). This can speed up 298 spliced alignment and genome annotation (CpGAT) jobs, in that more than one genome 299 segment can be evaluated concurrently on separate processor threads. As an alternative 300 for even more processing power, xGDBvm is capable of sending input data for spliced 301 alignment jobs to high-performance computing facilities, either as a standalone job or as 302 part of an annotation workflow. For this option, the user’s input data must be on a VM- 303 mounted iPlant Data Store directory and assigned to a GDB with ‘Development’ status. 304 GeneSeqer-MPI and GenomeThreader binaries, along with wrapper scripts for job 305 submission to an HPC server, are installed in iPlant’s Discovery Environment 306 (https://de.iplantcollaborative.org/de/) as executable ‘apps’. Client access to HPC 307 resources and apps is managed via the Agave API (Dooley et al., 2012), 308 http://agaveapi.co, which provides an open-source platform for interacting with 309 computational 310 (https://www.xsede.org/). xGDBvm uses Agave’s implementation of the OAuth2 311 (http://oauth.net) standard for authorization and subsequent authentication to use apps. 312 Under Manage → Remote Jobs, users first submit their iPlant username/password in 313 return for OAuth2 credentials that are stored securely on the VM and allow access to 314 remote applications (GeneSeqer-MPI and GenomeThreader). The user can then log in 315 and obtain a temporary access token and refresh token for authentication. The VM- resources that are managed under the XSEDE system 15 316 cached refresh token is also used by local scripts to re-authenticate API access during 317 automated workflow processing. The user can select the app size (i.e. number of 318 processors) for optimal efficiency given their genome size and complexity and then 319 return to the GDB Configuration page, select the ‘remote’ option for spliced alignment, 320 and initiate the automated workflow. The xGDBvm workflow script copies relevant input 321 data (genome, transcript and/or protein) to a temporary directory on the user’s mounted 322 Data 323 (https://curl.haxx.se) to a custom wrapper script (see Fig. 3). The wrapper script accepts 324 parameters, splits and indexes input files as appropriate for multiple processors, and then 325 issues a command to launch GeneSeqer-MPI or GenomeThreader on the specified HPC 326 server cluster. The xGDBvm workflow updates remote job status periodically using a 327 callback URL to xGDBvm and/or email notification service. Output data are copied to 328 specified subdirectory on the user’s Data Store, directory where xGDBvm’s workflow 329 can access them for further processing. Remote job details and status are tracked by 330 xGDBvm, and users can access job lists, query remote job status, and kill a remote job 331 using the Manage → Remote Jobs GUI (Fig. 4C). 332 Remote GeneSeqer or GenomeThreader spliced alignment jobs can also be run as a 333 standalone process via Manage → Remote Jobs. Output is archived on the users’ Data 334 Store directory, and xGDBvm can be directed to evaluate the output and copy output files 335 to an input directory for inclusion in workflow processing. Store directory and issues a job submission command via cURL 336 337 338 Logging / troubleshooting 339 Each step in xGDBvm’s computational workflow script (see Supplemental Fig. 2) is 340 displayed dynamically during automated workflow operation and saved in a process log. 341 Common errors (e.g. mismatch in data input/output, incorrect format, duplicate IDs) are 342 flagged and logged in an error file, along with user hints to remedy the problem (see 343 Supplemental Fig. 3). A separate file is created for logging CpGAT progress. 16 344 Outputs and data analysis tools 345 xGDBvm displays the output of workflow processing as schematized glyphs, organized 346 into color-coded tracks, in a full-featured genome browser (Fig. 5). Standard tracks 347 include EST, cDNA, TSA (transcript sequence assembly), and protein spliced 348 alignments; pre-computed and CpGAT gene predictions; and regions that have been 349 repeat masked or assigned as spacer regions (N-substituted). Additional user-generated 350 tracks include yrGATE annotations and region-specific CpGAT annotations. Advanced 351 users can create unlimited additional tracks by manually populating new data tables and 352 modifying configuration files. The xGDBvm genome browser has track features similar 353 to those currently available at http://plantgdb.org (zoom/scroll; show/hide or reorder 354 tracks; change font size; view base pair level). The genome browser also includes a suite 355 of analysis tools including search and retrieve for sequence or subsequence regions 356 (introns, exons, up/downstream regions); NCBI-BLAST for sequence queries within or 357 across genomes; region-specific GenomeThreader and CpGAT tools; and the ability to 358 add a custom track from a local GFF file. Complementing the Genome Context View are 359 searchable, tabular views for each Feature Track type ordered by genome position. The 360 Gene Models table displays annotated loci along with structural metadata, similarity 361 descriptions, GAEVAL gene quality/coverage, and yrGATE annotation status (see 362 below). The Aligned Proteins and Aligned Transcripts tables display splice-aligned 363 sequences of each type with filters for alignment quality/coverage and links to alignment 364 details. A separate page for GAEVAL Scores displays comprehensive gene quality data 365 based on comparison of gene predictions with alignment evidence and offers multiple 366 search filters. 367 All inputs, outputs, and archives (see below) are stored hierarchically under 368 /xGDBvm/data/GDBnnn/data/, and they are also available for download to local storage 369 using the VM’s GUI (View → GDBnnn → Data Download). Using this download 370 service, the user could for example retrieve GFF-formatted annotation outputs from 371 CpGAT for use in further analysis or display on a different genome browser. Data files 372 can also be copied to the Data Store either manually or by creating and copying a GDB 373 Archive (see below). 17 Figure 5. Genome context view. Shown is a typical region from the Capsella rubella genome annotation described under Results. Genome span is shown in yellow, and genome features (tracks) are as labeled to the left and above each track, and drag-and-drop reorder and “hide track” features are implemented here. Top bar provides search and navigation controls; left bar contains links to tools and views, as well as to configuration and help pages. Region submenu (orange) contains zoom/scroll, region-specific tools and formatting controls. See Table 1 for details of xGDBvm tools and features. 374 Updating or adding tracks 375 In cases where the user may wish to append or replace data, xGDBvm includes an Update 376 branch to the data workflow allowing any track to be appended or replaced. The user sets 377 an ‘Update’ flag on the configuration page, specifies a directory where update data 18 378 resides, and selects the data type(s) and update action(s) desired. The user then clicks 379 ‘Update’, which adds or replaces data inputs and re-runs appropriate scripts to update the 380 genome data tables, indices and display. All update actions are logged in the same way as 381 a new GDB, appended to the same process log. 382 The xGDBvm wiki (http://goblinx.soic.indiana.edu/wiki/) includes complete instructions 383 for adding additional annotation or alignment tracks beyond the five standard tracks 384 available. Users familiar with MySQL and the necessary computational steps can 385 completely customize an instance of xGDBvm, using pre-computed data as inputs. 386 Managing xGDBvm datasets 387 Output datasets can be managed on the Manage → Config/Create → Archive/Delete 388 page (Fig. 4B). For archiving a GDB, the entire output directory tree is compressed as a 389 tar archive and stored in an Archive directory under /xGDBvm/data/ArchiveGDB/, and 390 the archive can be copied to the user’s Data Store with a single button click. If the 391 corresponding GDB is later dropped (see below) or becomes corrupted, the archive can 392 be readily restored using the ‘Restore from Archive’ button. GDB archives also facilitate 393 sharing data with other researchers, who can use the ‘Restore from Archive’ function to 394 load any archive to their own VM. In addition, all GDB can be archived together using 395 the ‘Archive All’ function. Any ‘Current’ xGDBvm database can be discarded using the 396 ‘Drop’ button. This removes all GDB-associated directories and their output data, but 397 preserves the GDB ID and its stored configuration data, allowing users to build on the 398 previous configuration or restore (see above) a GDB. Finally, the most recently added 399 GDB can be deleted using ‘Delete’, or all GDB can be deleted using ‘Delete All’. 400 Reannotating with yrGATE 401 A key feature of xGDBvm is the ability to flag low-quality gene structures and improve 402 them in-place by manual re-annotation. For each genome displayed on xGDBvm, the 403 Gene Models page provides filters to select high coverage / low integrity models (based 404 on GAEVAL quality score and coverage) that might be improved by manual inspection 405 (Fig. 6A). Users can create an annotation login account and correct, confirm, or 19 Figure 6. Gene model improvement using yrGATE. A. A published gene model from Capsella rubella (Carubv1011418m.g) showing high coverage/low integrity in the Locus Table (upper table, highlighted columns). B. Corresponding gene model in genome context view (blue glyph). CpGAT annotated this region as two distinct loci (magenta glyph), backed up by both Arabidopsis protein (black) and cDNA (light blue). The region was then re-annotated using yrGATE (dark and light green glyphs) to confirm the most probably genic structure of this region based on available evidence. yrGATE glyphs are color-coded according to the type assigned by the annotator, e.g. dark green (improved structure); light green (new structure not previously annotated). 406 disqualify any gene prediction using the yrGATE annotation tool (Wilkerson et al., 407 2006); see Fig. 6B. The yrGATE tool offers point-and-click simplicity for building a 408 gene structure, enhanced by dynamic reporting of GAEVAL scores to guide the user to 409 the best possible model based on evidence alignments. yrGATE includes curation tools 410 for users who are assigned Administrator status, providing a quality check for submitted 411 annotations prior to their display. All re-annotation and curation steps are carried out in a 412 single browser window with portals to NCBI BLAST and other analysis tools, and users 413 can manage their own annotations (save, submit for curation, delete) on the Community 414 Central pages. Administrative features include the ability to assign users to annotation 415 working groups, track annotation totals for each user, and configure one or more email 416 addresses for administrative notification. Once curated, yrGATE annotations are 20 417 displayed as a separate track in the xGDBvm genome browser with color-coding to 418 indicate re-annotation class (Fig. 6B), and these can be downloaded in GFF3 or FASTA 419 format. 420 Benchmarking xGDBvm 421 Whole genome annotation. Capsella rubella is an Arabidopsis relative with a sequenced 422 genome totaling 134.8 Mb (Slotte et al., 2013). We evaluated xGDBvm as a tool for new 423 genome annotation using the Capsella rubella genome assembly (see Methods for 424 sequence sources and parameters). We obtained both Arabidopsis thaliana cDNA 425 sequences and A. thaliana predicted proteins as input for evidence alignments. We first 426 computed high-quality transcript and protein spliced alignments using the ‘standalone’ 427 HPC job submission tool in an xGDBvm instance at iPlant. The GeneSeqer-MPI job (8 428 processors with 64 threads) and GenomeThreader job (2 processors with 12 threads) 429 finished in 7 hr and 1 hr, respectively. These outputs were used as input for an annotation 430 workflow (with CpGAT option selected) in xGDBvm. The CpGAT reference dataset was 431 the entire set of UniRef90 Viridiplantae proteins (see Methods). In addition, the C. 432 rubella annotation dataset (in GFF3 format) was uploaded to xGDBvm for comparison. 433 The annotation of 873 scaffolds was completed in approx. 12 days on a single core 434 processor VM with 4 GB RAM. The results are shown in Table 2. xGDBvm completed 435 49,947 cDNA spliced alignments and 28,595 protein spliced alignments. The CpGAT 436 annotation generated 25,498 gene models, compared to 28,447 gene models from the 437 published C. rubella annotation. A total of 4,368 loci from the published annotation had 438 no match in the CpGAT set (as determined by overlap), while 861 loci were unique to 439 CpGAT. Comparison of 19,892 loci with gene models from both CpGAT and the 440 published annotation using ParsEval (Standage and Brendel, 2012) revealed a high level 441 of congruence between the two data sets. More than 60% of the gene models compared 442 had identical coding sequences. At the level of individual exons, the sensitivity (true 443 positive rate) was 69% and the specificity (true negative rate) was 68%, or 89% and 88% 444 respectively if restricted to coding exons. At the level of individual nucleotides, the 445 sensitivity and specificity were 97% and 96%, respectively. These data demonstrate the 446 reliability of CpGAT as a workflow for producing a provisional genome annotation (our 21 447 purpose is not to present a detailed comparison of these two annotations; the respective 448 evidence alignment datasets and thresholds were likely not identical, making such 449 detailed analysis complex). 450 Re-annotation of low quality predictions. We evaluated GAEVAL gene quality for the 451 Capsella rubella annotation dataset on a locus basis by setting a locus table filter for 452 average integrity < 75% and coverage > 75%. This filter resulted in 254 questionable loci 453 with likely annotation errors for CpGAT models compared to 558 questionable models in 454 the published annotation set (Table 2). This subset represents models for which re- 455 annotation has a high probability of improving gene prediction via the yrGATE tool. We 456 chose an example of a locus from the published annotation that was flagged by GAEVAL 457 as possibly erroneous, Carbubv1011418.m.g (Fig. 6). The CpGAT annotation for this 458 region was split into two distinct, complete gene structures, identified as scaffold_1.g5.t1 459 and scaffold_1.g6.t1 Using the yrGATE tool, we confirmed the CpGAT models as more 460 accurately representing the evidence alignments (dark and light green tracks in Fig. 6B). 461 Genome region. Another use for xGDBvm is to annotate a genome segment containing a 462 specific gene or region of interest. This would typically be a rapid turnaround analysis 463 compared to whole genome analysis and thus could be carried out using internal 464 computing resources, possibly repeatedly under different parameter regimes. As an 465 example, we used a Setaria italica predicted protein, annotated as ‘stem-specific protein 466 TSJT1-like’ as a tBLASTn query against the Musa acuminata subsp. Malaccensis whole 467 genome sequence data in GenBank. We retrieved a contig (839) that contained a region 468 of high similarity to this sequence (see Methods). We then configured xGDBvm inputs 469 consisting of Musa genomic contig 839, the current Musa acuminata EST dataset from 470 GenBank, and the predicted protein translations from the annotated genome of a related 471 monocotyledonous 472 (http://www.brachypodium.org). The workflow included gene prediction using CpGAT 473 with UniRef90 proteins from Viridiplantae as a reference dataset (see Methods). The 474 CpGAT output included 4 evidence-based loci and 12 ab initio predicted genes, including 475 a model fully supported by transcript alignment in the region with high similarity to 476 XP_004977556 (Supplemental Fig. 5). plant species Brachypodium distachyon 22 477 xGDBvm implementation 478 iPlant. xGDBvm has been deployed as a public image on iPlant’s Atmosphere Cloud 479 Service (https://atmo.iplantcollaborative.org/application). Researchers can launch an 480 xGDBvm instance and explore it once they have obtained an iPlant user account 481 (https://user.iplantcollaborative.org/register/) using an institutional email address. An 482 iPlant account also grants the user a home page on iPlant’s Data Store. Step-by-step 483 instructions 484 http://goblinx.soic.indiana.edu/wiki/doku.php?id=user_instructions, can be summarized 485 as follows: 1) In the Atmosphere Control Panel, find the latest xGDBvm image, launch 486 an instance, and attach an external block storage volume using drag-and-drop; 2) Access 487 the instance’s secure shell using iPlant credentials and type simple commands to update 488 xGDBvm code, set a Web password, initialize IRODS/FUSE, mount external storage, 489 and launch a configuration script; 3) Access the VM’s GUI via HTTPS or VPN and 490 follow instructions there to configure/create a genome annotation. 491 Indiana University. xGDBvm has also been implemented on a ‘production’ virtual 492 server 493 (http://goblinx.soic.indiana.edu/PdomGDB), a genome database for Polistes dominula 494 (European paper wasp), as well as the test datasets described here (see Data Access). 495 PdomGDB provides a showcase for the xGDBvm platform, including the addition of 496 extra nonstandard feature tracks created using methods outlined in the xGDBvm wiki 497 (http://goblinx.soic.indiana.edu/wiki/doku.php?id=configure_new_track). PdomGDB is 498 actively being updated by the Polistes research community using the yrGATE tool for 499 contributing expert-curated gene annotations, as described in this manuscript (accepted 500 submissions 501 http://goblinx.soic.indiana.edu/yrGATE/GDB001/CommunityCentral.pl). This website 502 also includes general information on the xGDBvm project on the project home page 503 (http://goblinx.soic.indiana.edu/index.php). 504 Public 505 http://brendelgroup.github.io/xGDBvm/. at for setting Indiana repository. up xGDBvm, University, serving available as are The xGDBvm a on the host for Wiki PdomGDB accessible project maintains at at a presence at The xGDBvm-specific software can be 23 506 accessed 507 developers can contribute via git pull requests, and users can screen pending issues and 508 report new ones. xGDBvm is licensed under Gnu General Public License, version 3. The 509 repository includes case studies that illustrate real-world projects implemented using 510 xGDBvm (https://github.com/BrendelGroup/xGDBvm/tree/master/case-studies/). and updated from https://github.com/BrendelGroup/xGDBvm, where 511 512 24 513 DISCUSSION 514 xGDBvm’s utility 515 As an all-in-one solution to genome annotation and analysis, xGDBvm is unique among 516 currently available packages. Configured as a virtual server with a complete GUI 517 interface and HPC capabilities, xGDBvm removes barriers to entry imposed by extensive 518 software installation, testing and troubleshooting, and command-line operation. The 519 xGDBvm GUI guides inexperienced users by presenting only actionable choices and 520 instructions at each step, as well as providing pre-installed sample datasets, input data 521 validation, error flagging, and extensive help pop-ups. Data management is handled 522 entirely within the xGDBvm environment, allowing the user to focus on the overall 523 annotation task rather than managing intermediate input/output files. The resulting Web 524 site can be either public or password-protected as desired, and the contents can be 525 archived, shared, or exported for display using other genome display platforms. We 526 expect that this combination of features will make xGDBvm attractive to research groups 527 with a desire to annotate genome data but limited access to informatics support. 528 There are several use cases for xGDBvm in its current implementation at iPlant: 529 1) Researchers with a newly assembled genome who can quickly align relevant transcript 530 assembly and/or protein data to determine probable gene location and then perform gene 531 structure computation on either a portion of the genome or the genome in its entirety, 532 resulting in a “first pass” genome annotation. 533 2) Researchers with a recently annotated genome who wish to share it and improve 534 annotation quality via community annotation. 535 3) Researchers who wish to create their own copy of a ‘finished’ genome annotation in 536 order to run gene quality analyses with up-to-date transcript data, and/or carry out 537 targeted or general re-annotation. 25 538 4) Instructors desiring a hands-on environment for exploring the principles of genome 539 annotation with real data and access to HPC resources. 540 In scope, xGDBvm provides an easy-to-use and versatile platform for annotating and 541 analyzing genomes at various stages of completion. At one extreme, a finished genome 542 can be loaded from data files available online, giving the user complete freedom to 543 analyze and re-annotate genes previously published. At the opposite extreme, a newly 544 assembled genome can be loaded together with related-species data and/or short read 545 assemblies, and CpGAT can be invoked to automatically build a credible draft genome 546 annotation for further analysis. With any implementation, the powerful built-in tools for 547 gene quality analysis and re-annotation make xGDBvm a valuable asset for improving 548 genome structure annotation as well. 549 Another advantage of xGDBvm is its flexibility, as it allows multiple genome views to be 550 created in one instance and supports updates to any type of existing data. Finally, 551 xGDBvm provides extensive documentation of the annotation and update process, 552 important both for troubleshooting and for reporting results. 553 Comparison to similar tools 554 Other cloud-based annotation tools are available: Maker (http://www.yandell- 555 lab.org/software/maker.html) is a eukaryotic genome annotation pipeline that can be 556 installed in a variety of server environments (Cantarel et al., 2008) and a version of 557 Maker (Maker-P) is installed at iPlant Atmosphere as a virtual machine with links to HPC 558 (https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+at+iPlant). 559 web-based genome analysis platforms Galaxy (Goecks et al., 2010) offers cloud 560 installation 561 (https://aws.amazon.com/ec2/). xGDBvm differs from these tools in that it offers a 562 comprehensive package combining a structured environment for data inputs, automated 563 data processing with sanity checks, and tools for genome display, search and re- 564 annotation built in. 565 Limitations via Amazon’s Elastic Cloud Compute (EC2) The service 26 566 As currently configured, xGDBvm is unable to map short read data onto a genome, so 567 users will need to assemble short reads de novo, prior to submitting data to xGDBvm as a 568 TSA dataset. xGDBvm’s computational workflow can currently accommodate only one 569 track per spliced alignment data type (EST, cDNA, TSA, Protein), and two tracks for 570 gene model predictions. Users who require additional tracks must configure them 571 manually. xGDBvm’s HPC processes are currently limited to spliced alignment 572 computations, whereas gene structure annotation via CpGAT is limited by the processing 573 power of the VM. 574 VM availability and usage at iPlant, as well as access to HPC resources, can be expected 575 to be limited based overall capacity and the amount of demand on the respective systems. 576 Users wishing to increase their usage quotas may be required to justify their request. 577 Future directions 578 xGDBvm is still being developed and improved. The road map includes additional 579 features such as modular data workflows allowing unlimited track numbers, and 580 additional options for gene annotation and evaluation. xGDBvm’s implementation of the 581 Agave API should facilitate the addition of new standalone or pipeline-integrated 582 computation tools that can take advantage of high performance processing (e.g. Maker). 583 We also envision integrating xGDBvm with other analysis platforms including one that 584 allows visualization of common introns (Wilkerson et al., 2006). 585 586 27 587 METHODS 588 xGDBvm architecture and software 589 The xGDBvm architecture is shown in Fig. 3, and a more detailed description can be 590 found in the wiki (http://goblinx.soic.indiana.edu/wiki/). We currently maintain two 591 parallel implementations of xGDBvm, one at Indiana University (xGDBvm-GoblinX) on 592 a virtual server using Red Hat Enterprise Linux (http://www.redhat.com), and the other 593 on 594 (https://www.centos.org). 595 (http://www.apache.org) with very similar configurations, but xGDBvm-iPlant also 596 includes openSSL (https://www.openssl.org) and Apache’s mod_ssl for secure access 597 over 598 (https://www.mysql.com), Perl (http://www.perl.org) and PHP (http://php.net/) to handle 599 web scripts and some server-side functions, with additional Perl modules for cgi and 600 session management. Installed Javascript libraries include JQuery and JQuery UI 601 (https://jquery.com). BioPerl (http://www.bioperl.org/wiki/Main_Page) and EMBOSS 602 (http://emboss.sourceforge.net) were installed to handle certain operations. Additional 603 binaries, including NCBI-BLAST+ (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/+) 604 as well as the computation-related software described earlier, were installed under 605 /usr/local/bin/ or /usr/local/src/ (see Supplemental Table 1 for a complete list of installed 606 binaries). 607 The document root directory is /xGDBvm/ under the VM’s root partition. xGDB scripts 608 (modified from Schlueter et al. (2006)), PHP scripts, and other assets (Javascript files, css 609 files and images) were installed under /xGDBvm/XGDB/, and administrative scripts 610 under 611 /xGDBvm/scripts/, and custom yrGATE, GAEVAL and CpGAT packages were installed 612 under /xGDBvm/src/. The entire document root contents (excluding binaries) is 613 maintained 614 (https://github.com/BrendelGroup/xGDBvm). the iPlant HTTPS. Atmosphere Both Additional /xGDBvm/admin/. as platform (xGDBvm-iPlant) implementations software includes Workflow-related a public run using CentOS Linux Apache web server MySQL shell scripts repository client/server are found at software under GitHub 28 615 The xGDBvm architecture is designed to segregate input data, dynamically generated 616 output data, and static web scripts that comprise the xGDBvm core (see Fig. 3). The 617 user’s Data Store directory (for inputs, segregated under a common subdirectory 618 xgdbvm/) and block storage volume (for outputs) are mounted under /home/xgdb-input/ 619 and /home/xgb-data/, respectively. These are symbolically linked to paths under the 620 document root (/xGDBvm/input and /xGDBvm/data), and all xGDBvm scripts reference 621 these data paths for reading and writing data. Data destination directories are assigned 622 ownership by group ‘xgdb’ with read-write privileges, and the ‘apache’ user is added to 623 the ‘xgdb’ group under /etc/group. Temporary data are saved to /xGDBvm/data/tmp. 624 To provide secure transactions where passwords are being sent over the Web, xGDBvm- 625 iPlant enforces HTTPS (with self-signed cert) on all pages. Website password protection 626 via .htaccess is required upon initial configuration, so only users who have the password 627 can view the website online. Password protection can also be modified using the 628 xGDBvm Admin GUI to include just the Manage functions (Admin, Configure/Create 629 and Remote Jobs); in this configuration, the VM’s genome browsers and data download 630 sections are public. The back-end MySQL password can also be customized via the GUI 631 for additional site security. Web access to the mounted storage directories is blocked by 632 the Apache configuration, so the user’s mounted disks are not exposed on the Internet. 633 Certain VM assets (OAuth2 credentials, MySQL password) are stored under 634 /xGDBvm/admin/ which is protected via the Apache configuration. 635 Benchmarking xGDBvm 636 The hardmasked Capsella rubella assembly (Slotte et al., 2013) was downloaded from 637 JGI 638 psf.org/pub/compgen/phytozome/v9.0/Crubella/assembly/Crubella_183_hardmasked.fa.g 639 z; user account required). Arabidopsis thaliana cDNA FASTA sequences were 640 downloaded 641 est"[Filter]) AND Arabidopsis thaliana[Organism]). Predicted protein translations were 642 obtained 643 (ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/). UniRef90 proteins (ftp://ftp.jgi- from from NCBI the (http://www.ncbi.nlm.nih.gov/nuccore?term=("mrna Arabidopsis TAIR10 genome not release 29 644 from 645 (http://www.uniprot.org/uniref/?query=uniprot:(taxonomy:”Viridiplantae+[33090]”)+ide 646 ntity:0.9) and the file renamed as UniRef90-Viridiplantae.fa. A genome annotation based 647 on these input data was created on an xGDBvm instance at iPlant with 2 CPUs and 4 GB 648 RAM. 649 parameters were species model:Arabidopsis, alignment stringency:strict. CpGAT 650 parameters were BGF:Arabidopsis, Augustus:arabidopsis, GeneMark:a_thaliana; Skip 651 Mask=T. For comparison, the current C. rubella annotation (GFF3) was downloaded 652 (ftp://ftp.jgi- 653 psf.org/pub/compgen/phytozome/v9.0/Crubella/annotation/Crubella_183_gene.gff3.gz) 654 and included as input in the genome workflow. Additional spliced alignment 655 benchmarking and case studies used GeneSeqer-MPI and GenomeThreader running on 656 high performance computing systems at Texas Advanced Computing (TACC; 657 https://www.tacc.utexas.edu), accessed from xGDBvm as public apps via the Agave API. 658 For the second use case, we queried the NCBI whole genome shotgun sequence (wgs) 659 library 660 http://www.ncbi.nlm.nih.gov/assembly/GCF_000313855.1/) 661 (http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=tblastn) with a Setaria italica 662 predicted protein (XP_004977556.1). Musa acuminata contig 839 (GenBank accession 663 CAIC01023586.1) 664 (http://www.ncbi.nlm.nih.gov/Traces/wgs/fdump.cgi?CAIC01,23586); the resulting file 665 was named Musa_contig_839.gdna.fa, and the FASTA header was simplified to 666 “>Musa_contig839”. Musa acuminata EST sequences in FASTA format were retrieved 667 from 668 (http://www.ncbi.nlm.nih.gov/nucest?term=Musa_acuminata%5BOrganism%5D]) 669 renamed as musa_est.fa. UniRef90 proteins from Viridiplantae were retrieved in FASTA 670 format as described above. xGDBvm’s GeneSeqer parameters were species model:rice, 671 alignment stringency:strict. CpGAT parameters were BGF:rice, Augustus:maize, 672 GeneMark:o_sativa; Skip Mask=T. 673 Accession Numbers Viridiplantae were retrieved in FASTA (https://atmo.iplantcollaborative.org/application). for Musa acuminata was subsp. retrieved format from xGDBvm’s Uniprot GeneSeqer malaccensis (banana; using tblastn from NCBI NCBI and 30 674 Datasets described under Benchmarking can be viewed and downloaded from the 675 xGDBvm project pages at http://goblinx.soic.indiana.edu/GDB002/ (Capsella rubella 676 genome) and http://goblinx.soic.indiana.edu/GDB003/ (Musa acuminata contig 839). A 677 list of all Web resources referenced in this manuscript is found in Supplemental Table 2. 678 679 SUPPLEMENTAL DATA 680 681 Supplemental Figure 1. Input Data Validation. 682 Supplemental Figure 2. The xGDBvm automated workflow. 683 Supplemental Figure 3. Output data validation. 684 Supplemental Figure 4. Preconfigured example datasets. 685 Supplemental Figure 5. Annotation of a single genomic contig. 686 Supplemental Table 1.. xGDBvm Installed Software. 687 Supplemental Table 2. Hyperlinks referenced in the manuscript. 688 689 ACKNOWLEDGMENTS 690 We thank Ann Fu for help with initial development of the automated workflow, Shannon 691 Schlueter for advice in adapting his XGDB core code for the virtual environment, James 692 Denton for extensive debugging and yrGATE feature development, Jianqing Guan for 693 code to calculate dynamic GAEVAL scores, and Bruce Shei for system support at 694 Indiana University. We especially thank collaborators and colleagues at the iPlant 695 Collaborative (CyVerse) and Texas Advanced Computing Center (TACC) for their 696 assistance in integrating xGDBvm into the Atmosphere cloud environment and the Agave 697 API: Roger Barthelson and Shabari Subramaniam, who wrote and tested HPC wrapper 698 scripts for GeneSeqer-MPI and GenomeThreader, respectively; Andre Mercer, who 699 provided prototype PHP scripts for the API; and Edwin Skidmore, Rion Dooley, and 31 700 Matthew Vaughn who provided system troubleshooting and advice. This work was 701 supported by NSF award #1221984 to V. Brendel. 702 703 AUTHOR CONTRIBUTIONS 704 V.B. conceived the project and provided overall guidance; J.D carried out the project and 705 managed collaborations; D.S. tested xGDBvm functionality with actual datasets, 706 configured and extended a production xGDBvm server, ran ParsEval comparisons, and 707 contributed 708 implementation at iPlant and created the prototype HPC wrapper scripts. 709 FIGURE LEGENDS 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 Figure 1. Overview of xGDBvm as implemented at CyVerse (iPlant). xGDBvm is a virtual server environment for gene structure annotation that can be cloned, configured, populated with input data, and run from a Web browser in a few steps, as summarized here: A. Log in to the CyVerse Atmosphere Control Panel (https://atmo.iplantcollaborative.org/application) (1) and click to create a new instance (cloned copy) of xGDBvm (2), create a block storage volume,for output data, and attach it to the instance (3). Open a Web shell interface (4), accessible from the Control Panel, and type a series of commands to set up and configure the new xGDBvm instance, also mounting the Data Store and the attached volume. B. Log in to the CyVerse Data Store cloud storage system (https://de.iplantcollaborative.org/de/) and upload input data files to an input data directory (accessible to the VM) using a batch uploading tool. Naming conventions are used to identify each input type. C. Log in to the xGDBvm instance’s Graphical User Interface (GUI) using HTTPS via its unique IP address or using a Virtual Network Client (VNC) (1). All subsequent steps are carried out using the xGDBvm GUI. Authorize the VM to connect to remote HPC resources via the Agave API (http://agaveapi.co) (2). Configure the path to Data Store inputs and set other parameters including remote job execution (optional). xGDBvm will validate files, return expected outputs and flag any input file errors (3). Initiate automated workflows and monitor progress (4). The workflow sends some data remotely for processing on High Performance Computing (HPC) resources (https://www.xsede.org/) managed by Agave APIs, and processes other files locally using the attached volume as a scratch disk. The xGDBvm workflow waits for HPC outputs, then proceeds with the annotation process. Output data are written to the external volume and can be accessed from xGDBvm Web browser as GDB001, GDB002, etc. (5). In addition to a fully featured genome browser, xGDBvm includes tools to query, update, reannotate, download, or archive outputs to the user’s Data Store. For details, refer to the xGDBvm wiki (http://goblinx.soic.indiana.edu/wiki/doku.php). some parsing scripts; N.M. provided guidance for xGDBvm’s 32 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 Figure 2. Data process schema. Input data types (with standardized names as indicated), computational modules, and outputs are shown. Images are screenshots of color-coded track glyph types (gene models; splice alignments) and track flags (quality scores) displayed in the xGDBvm genome browser. Figure 3. xGDBvm architecture. An xGDBvm VM instance, as hosted on the CyVerse Atmosphere cloud infrastructure (https://atmo.iplantcollaborative.org/application), has separate file system partitions under root (containing the xGDBvm Web GUI, scripts, binaries, and other software) and /home/ (which is configured with mount points for the user’s Data Store home directory for data input and a block storage volume for data output). The Agave API, hosted by the CyVerse Discovery Environment, is used for authentication of the VM via OAuth2 and for management of High Performance Computing applications and job submission. A key feature of xGDBvm is the ability to attach and mount the output volume to a different VM and reconstitute the annotation outputs and display. See text for details. Figure 4. xGDBvm data management. A. Screenshot of the GDB Configuration page, set up for processing Example data. Each genome annotation is assigned a unique identifier (GDB001, GDB002, etc.) and a user-provided name. In addition to form fields for input data path, annotation parameters, and metadata, this page provides extensive color-coded information about all system settings (e.g. license keys, storage capacity, login status, displayed in blue-green), input data validity (light green), and expected output (orange). The form includes buttons that launch modal windows to initiate computational workflow or edit configuration. B. Screenshot of Archive/Delete menu, showing genome databases with ‘Current’ (blue; computation complete) or ‘Development’ (grey; not yet run) status. Genome annotations are identified as GDB001, GDB002, etc. Each table row displays information about a GDB including time stamps as well as action buttons that allow the user to Drop, Delete, Archive, Delete Archive, or Copy database (see text for details). Global action buttons (top right) allow the user to Delete or Archive all data on the VM. C. Screenshot of List All Jobs page with tools to monitor and manage remote HPC jobs. The page displays IDs, job metadata, time stamps, color-coded status indicators and action buttons to manage output (Stop Job, Delete Job, View Logs, Copy Output) via the Agave API; see text for details. Figure 5. Genome context view. Shown is a typical region from the Capsella rubella genome annotation described under Results. Genome span is shown in yellow, and genome features (tracks) are as labeled to the left and above each track, and drag-anddrop reorder and “hide track” features are implemented here. Top bar provides search and navigation controls; left bar contains links to tools and views, as well as to configuration and help pages. Region submenu (orange) contains zoom/scroll, region-specific tools and formatting controls. See Table 1 for details of xGDBvm tools and features. Figure 6. Gene model improvement using yrGATE. A. A published gene model from Capsella rubella (Carubv1011418m.g) showing high coverage/low integrity in the Locus Table (upper table, highlighted columns). B. Corresponding gene model in genome 33 context view (blue glyph). CpGAT annotated this region as two distinct loci (magenta glyph), backed up by both Arabidopsis protein (black) and cDNA (light blue). The region was then re-annotated using yrGATE (dark and light green glyphs) to confirm the most probably genic structure of this region based on available evidence. yrGATE glyphs are color-coded according to the type assigned by the annotator, e.g. dark green (improved structure); light green (new structure not previously annotated). TABLES Table 1. xGDBvm features Section 1 Feature Administrate Manage Create/Configure View GDB: Feature Tracks View GDB Remote Jobs GDB Home Page Genome Context View Gene Predictions (Loci) Aligned Proteins, Aligned Transcripts GAEVAL scores Download region View GDB: Tools (Genome Context View) 783 784 785 786 787 788 789 790 791 792 793 Download data Search ID or keyword Functions Modify password protection; customize site name; administer yrGATE user accounts Configure new GDB; validate input files; view/edit configuration; Initiate, monitor automated workflows; view log files; archive/restore/delete GDB, copy archive to Data Store Configure OAuth2 login, job APIs, App IDS; submit jobs; view job status; manage jobs (CyVerse login required) GDB summary data; view genome region or search for sequence View all tracks by genome segment and region; zoom, jump up or downstream, view nucleotide level alignments All annotated loci and metadata in tabular view; search/filter queries; yrGATE summaries for each locus; download as .csv All spliced alignments in tabular views; search/filter queries; download as .csv Detailed gene quality scores for each Gene Prediction track; search/filter queries Download any sequence type from region as FASTA; download annotations from region in GFF3 or NCBI format Download individual input files, output files (all types), or GDB archive files to the local drive Search and retrieve FASTA sequence or subsequence (introns, exons, up/downstream) for any feature displayed on GDB Blast GDB Match sequence within GDB Blast All GDB Match sequence across multiple GDB CpGAT annotate region Regional gene predictions and quality scores 34 Add Custom Track Add custom track from local GFF3 file GenomeThreader region Regional spliced alignment of proteins yrGATE Tool for creating/submitting user-contributed annotations; with portals to NCBI ORF finder; NCBI BLAST; GENSCAN; GeneMark; CpGAT Searchable list of curated yrGATE (user-submitted) annotations; Download annotations (FASTA, GFF3) Manage User Annotations (admin account & login required) View Group Annotations (admin account & login required) Curate user-submitted Annotations (admin account & login required) User instructions and video tutorials; also available as contextual help popups Documentation and instructions for users/admins/developers Source code; Issue tracking; Case studies Community Central Annotate My Annotations My Groups My Admin Help Help pages 794 xGDBvm Wiki (external) Github Repository (external) 1- as implemented on iPlant (CyVerse) Atmosphere cloud service 35 795 796 797 798 1 Table 2. Annotation of the Capsella rubella genome Genome segments 853 799 800 801 802 803 Total length (bp) 134,834,574 1234- A. thaliana cDNA spliced alignments Total Cognate3 A. thaliana protein spliced alignments 49,947 44,870 34,629 CpGAT gene predictions Published gene predictions2 Transcripts Loci Questionable4 Transcripts Loci Questionable4 25,498 22,698 254 28,447 26,521 558 See also http://goblinx.soic.indiana.edu/GDB002/ for data display and download Source: ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Crubella/annotation/Crubella_183_gene.gff3.gz The single location with the best alignment score for a given query sequence Less than 75% integrity score and greater than 75% coverage based on GAEVAL analysis (see Methods) 36 Parsed Citations Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E. (2002). The enhanced suffix array and its applications to genome analysis. In Second Workshop on Algorithms in Bioinformatics (Springer-Verlag), pp. 449-463. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Borodovsky, M., Mills, R., Besemer, J., and Lomsadze, A. (2003). Prokaryotic gene prediction using GeneMark and GeneMark.hmm. In Current Protocols in Bioinformatics, Chapter 4, Unit 4 5, A.D. Baxevanis, ed ( Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Cantarel, B.L., Korf, I., Robb, S.M., Parra, G., Ross, E., Moore, B., Holt, C., Sanchez Alvarado, A., and Yandell, M. (2008). MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18, 188-196. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Dooley, R., Vaughn, M., Stanzione, D., Terry, S., and Skidmore, E. (2012). Software-as-a-Service: The iPlant Foundation API. In 5th IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) (IEEE). Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Foissac, S., Gouzy, J.P., Rombauts, S., Mathé, C., Amselem, J., Sterck, L., Van de Peer, Y., Rouzé, P., and Schiex, T. (2008). Genome annotation in plants and fungi: EuGene as a model platform. Current Bioinformatics 3, 87-97. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Goecks, J., Nekrutenko, A., and Taylor, J. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11, R86. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Goff, S., Vaughn, M., McKay, S., Lyons, E., Stapleton, A., Gessler, D., Matasci, N., Wang, L., Hanlon, M., Lenards, A., Muir, A., Merchant, N., and al., e. (2011). The iPlant Collaborative: Cyberinfrastructure for Plant Biology. Front Plant Sci 2. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Gremme, G., Brendel, V., Sparks, M.E., and Kurtz, S. (2005). Engineering a software tool for gene structure prediction in higher organisms. Information and Software Technology 47, 965-978. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Grigoriev, I.V., Nordberg, H., Shabalov, I., Aerts, A., Cantor, M., Goodstein, D., Kuo, A., Minovitsky, S., Nikitin, R., Ohm, R.A., Otillar, R., Poliakov, A., Ratnere, I., Riley, R., Smirnova, T., Rokhsar, D., and Dubchak, I. (2012). The genome portal of the Department of Energy Joint Genome Institute. Nucleic Acids Res 40, D26-32. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Haas, B., Salzberg, S., Zhu, W., Pertea, M., Allen, J., Orvis, J., White, O., Buell, C.R., and Wortman, J. (2008). Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol 9, R7. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Haas, B., Delcher, A., Mount, S., Wortman, J., Smith, R., Hannick, L., Maiti, R., Ronning, C., Rusch, D., Town, C., Salzberg, S., and White, O. (2003). Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31, 5654 - 5666. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Hammesfahr, B., Odronitz, F., Hellkamp, M., and Kollmar, M. (2011). diArk 2.0 provides detailed analyses of the ever increasing eukaryotic genome sequencing data. BMC Res Notes 4, 338. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Hoff, K.J., Lange, S., Lomsadze, A., Borodovsky, M., and Stanke, M. (2016). BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32, 767-769. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Holt, C., and Yandell, M. (2011). MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Leroy, P., Guilhot, N., Sakai, H., Bernard, A., Choulet, F., Theil, S., Reboux, S., Amano, N., Flutre, T., Pelegrin, C., Ohyanagi, H., Seidel, M., Giacomoni, F., Reichstadt, M., Alaux, M., Gicquello, E., Legeai, F., Cerutti, L., Numa, H., Tanaka, T., Mayer, K., Itoh, T., Quesneville, H., and Feuillet, C. (2012). TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes. Front Plant Sci 3, 5. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Mungall, C.J., Misra, S., Berman, B.P., Carlson, J., Frise, E., Harris, N., Marshall, B., Shu, S., Kaminker, J.S., Prochnik, S.E., Smith, C.D., Smith, E., Tupy, J.L., Wiel, C., Rubin, G.M., and Lewis, S.E. (2002). An integrated computational pipeline and database to support whole-genome sequence annotation. Genome Biol 3, research0081-0081.0011. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Nocq, J., Celton, M., Gendron, P., Lemieux, S., and Wilhelm, B.T. (2013). Harnessing virtual machines to simplify next-generation DNA sequencing analysis. Bioinformatics 29, 2075-2083. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Potter, S.C., Clarke, L., Curwen, V., Keenan, S., Mongin, E., Searle, S.M., Stabenau, A., Storey, R., and Clamp, M. (2004). The Ensembl analysis pipeline. Genome Res 14, 934-941. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Reddy, T.B.K., Thomas, A.D., Stamatis, D., Bertsch, J., Isbandi, M., Jansson, J., Mallajosyula, J., Pagani, I., Lobos, E.A., and Kyrpides, N.C. (2015). The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res 43, D1099-1106. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Schlueter, S.D., Wilkerson, M.D., Dong, Q., and Brendel, V. (2006). xGDB: open-source computational infrastructure for the integrated evaluation and analysis of genome features. Genome Biol 7, R111. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Schlueter, S.D., Wilkerson, M.D., Huala, E., Rhee, S.Y., and Brendel, V. (2005). Community-based gene structure annotation. Trends Plant Sci 10, 9-14. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Slotte, T., Hazzouri, K.M., Agren, J.A., Koenig, D., Maumus, F., Guo, Y.L., Steige, K., Platts, A.E., Escobar, J.S., Newman, L.K., Wang, W., Mandakova, T., Vello, E., Smith, L.M., Henz, S.R., Steffen, J., Takuno, S., Brandvain, Y., Coop, G., Andolfatto, P., Hu, T.T., Blanchette, M., Clark, R.M., Quesneville, H., Nordborg, M., Gaut, B.S., Lysak, M.A., Jenkins, J., Grimwood, J., Chapman, J., Prochnik, S., Shu, S., Rokhsar, D., Schmutz, J., Weigel, D., and Wright, S.I. (2013). The Capsella rubella genome and the genomic consequences of rapid mating system evolution. Nat Genet 45, 831-835. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Specht, M., Stanke, M., Terashima, M., Naumann-Busch, B., Janssen, I., Hohner, R., Hom, E.F., Liang, C., and Hippler, M. (2011). Concerted action of the new Genomic Peptide Finder and AUGUSTUS allows for automated proteogenomic annotation of the Chlamydomonas reinhardtii genome. Proteomics 11, 1814-1823. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Standage, D., and Brendel, V. (2012). ParsEval: parallel comparison and analysis of gene structure annotations. BMC Bioinformatics 13, 187. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Stanke, M., Keller, O., Gunduz, I., Hayes, A., Waack, S., and Morgenstern, B. (2006). AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 34, W435-439. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Thibaud-Nissen, F., Souvorov, A., Murphy, T., DiCuccio, M., and Kitts, P. (2013). GNOMON (Eukaryotic Genome Annotation Pipeline). In The NCBI Handbook [Internet]. 2nd edition. (Bethesda (MD): National Center for Biotechnology Information (US)), pp. http://www.ncbi.nlm.nih.gov/books/NBK169439/. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Uberbacher, E.C., Hyatt, D., and Shah, M. (2004). GrailEXP and Genome Analysis Pipeline for genome annotation. In Current protocols in human genetics, J.L. Haines, ed, pp. Unit 6 5. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Usuka, J., Zhu, W., and Brendel, V. (2000). Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16, 203-211. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Wang, B.B., O'Toole, M., Brendel, V., and Young, N.D. (2008). Cross-species EST alignments reveal novel and conserved alternative splicing events in legumes. BMC Plant Biol 8, 17. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Wilkerson, M.D., Schlueter, S.D., and Brendel, V. (2006). yrGATE: a web-based gene-structure annotation tool for the identification and dissemination of eukaryotic genes. Genome Biol 7, R58. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title Yandell, M., and Ence, D. (2012). A beginner's guide to eukaryotic genome annotation. Nature Reviews Genetics 13, 329-342. Pubmed: Author and Title CrossRef: Author and Title Google Scholar: Author Only Title Only Author and Title xGDBvm: A Web GUI-driven workflow for annotating eukaryotic genomes in the cloud Jon Duvick, Daniel S Standage, Nirav Merchant and Volker P Brendel Plant Cell; originally published online March 28, 2016; DOI 10.1105/tpc.15.00933 This information is current as of June 17, 2017 Supplemental Data /content/suppl/2016/03/28/tpc.15.00933.DC1.html Permissions https://www.copyright.com/ccc/openurl.do?sid=pd_hw1532298X&issn=1532298X&WT.mc_id=pd_hw1532298X eTOCs Sign up for eTOCs at: http://www.plantcell.org/cgi/alerts/ctmain CiteTrack Alerts Sign up for CiteTrack Alerts at: http://www.plantcell.org/cgi/alerts/ctmain Subscription Information Subscription Information for The Plant Cell and Plant Physiology is available at: http://www.aspb.org/publications/subscriptions.cfm © American Society of Plant Biologists ADVANCING THE SCIENCE OF PLANT BIOLOGY
© Copyright 2026 Paperzz