xGDBvm: A Web GUI-driven workflow for annotating

Plant Cell Advance Publication. Published on March 28, 2016, doi:10.1105/tpc.15.00933
1
LARGE-SCALE BIOLOGY ARTICLE
2
xGDBvm: A Web GUI-driven workflow for annotating eukaryotic genomes in the
3
cloud
4
Jon Duvick1
5
Daniel S. Standage2
6
Nirav Merchant3
7
Volker P. Brendel4,5
8
Author Affiliations
9
1.
10
Department of Genetics, Development and Cell Biology, Iowa State University,
Ames, Iowa 50011 USA
11
2.
Department of Biology, Indiana University, Bloomington, Indiana 47405 USA
12
3.
Bio Computing Facility, University of Arizona, Tucson, Arizona 85721 USA
13
4.
Department of Biology and School of Informatics & Computing, Indiana
14
15
16
17
18
19
20
21
22
23
University, Bloomington, Indiana 47405 USA
5.
Corresponding author. E-mail: [email protected]
The author responsible for distribution of materials integral to the findings
presented in this article in accordance with the policy described in the Instructions
for Authors (www.plantcell.org) is: Volker P. Brendel.
Synopsis: xGDBvm is a novel tool for scalable, reproducible, and expandable genome
annotation via Web-based interfaces that seamlessly integrate background cloudbased data storage and high-performance computer resources.
1
©2016 American Society of Plant Biologists. All Rights Reserved.
24
ABSTRACT
25
Genome-wide annotation of gene structure requires the integration of numerous
26
computational steps. Currently, annotation is arguably best accomplished through
27
collaboration of bioinformatics and domain experts, with broad community involvement.
28
However, such a collaborative approach is not scalable at today’s pace of sequence
29
generation. To address this problem, we developed the xGDBvm software, which uses an
30
intuitive graphical user interface (GUI) to access a number of common genome analysis
31
and gene structure tools, preconfigured in a self-contained virtual machine image. Once
32
their virtual machine instance is deployed through iPlant’s Atmosphere cloud services,
33
users access the xGDBvm workflow via a unified Web interface to manage inputs, set
34
program parameters, configure links to high performance computing (HPC) resources,
35
view and manage output, apply analysis and editing tools, or access contextual help. The
36
xGDBvm workflow will mask the genome, compute spliced alignments from transcript
37
and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and
38
gene structure quality, and display output in a public or private genome browser complete
39
with accessory tools. Problematic gene predictions are flagged and can be re-annotated
40
using the integrated yrGATE annotation tool. xGDBvm can also be configured to append
41
or replace existing data or load pre-computed data. Multiple genomes can be annotated
42
and displayed, and outputs can be archived for sharing or backup. xGDBvm can be
43
adapted to a variety of use cases including de novo genome annotation, re-annotation,
44
comparison of different annotations, and training or teaching.
45
2
46
INTRODUCTION
47
The number of sequenced eukaryotic genomes is increasing rapidly due to advances in
48
sequencing technology and cost-effectiveness; for recent lists see https://gold.jgi.doe.gov
49
(Reddy et al., 2015) and http://www.diark.org/diark (Hammesfahr et al., 2011). However,
50
the pace of data acquisition leads to bottlenecks at both assembly and annotation stages,
51
before the sequence data can be consumed for research. In particular, annotating a novel
52
genome is often challenging due to our incomplete knowledge of what constitutes a gene
53
across a wide range of species, meaning that ab initio gene prediction, although useful, is
54
inadequate (Yandell and Ence, 2012). Full genome annotation typically consists of at
55
minimum: 1) optionally repeat masking the genome; 2) splice-aligning transcripts and
56
proteins from related species for evidence-based gene structure prediction; 3) using ab
57
initio gene finding algorithms to annotate possible gene structures; 4) combining the
58
above data sources to create a set of possible gene structures; and 5) filtering the results
59
through quality and/or similarity filters to find the most probable set of structures that
60
represent full-length or near-full-length coding genes. As a result, genome annotation is
61
necessarily a time-consuming and computationally intensive process that combines
62
numerous types of sequence analysis and heuristic prediction, typically relying on well-
63
annotated genomes as a reference and typically resulting in a far-from-perfect (but
64
arguably useful) draft annotation. A number of groups have published complete
65
computational pipelines for eukaryotic genome annotation (Mungall et al., 2002; Potter et
66
al., 2004; Uberbacher et al., 2004; Cantarel et al., 2008; Foissac et al., 2008; Holt and
67
Yandell, 2011; Specht et al., 2011; Grigoriev et al., 2012; Leroy et al., 2012; Thibaud-
68
Nissen et al., 2013; Hoff et al., 2016) However, these pipelines require considerable
69
expertise to install, configure, troubleshoot, and manage. We propose that a ‘turnkey’
70
genome annotation system could greatly benefit researchers who desire a credible draft
71
genome annotation to facilitate further research, as well as foster comparative genomics,
72
as early as possible in the life of their project. Among the desirable attributes of such a
73
system would be the following:
74
Easy to configure. An annotation workflow will necessarily combine a wide range of
75
computational tools whose successful configuration and interoperability would be
3
76
challenging for the non-specialist, so ideally it should be available as a precompiled
77
package. A common method for packaging and distributing such a complex system is via
78
a virtual machine or VM, which encapsulates the underlying server operating system, the
79
application software components along with all requisite software dependencies, and
80
configuration settings, all of which are stored (“imaged”) in such a way that they can be
81
copied and launched by means of commonly available virtualization tools and made
82
available to anyone with access to virtual server software such as KVM
83
(http://www.linux-kvm.org) or VirtualBox (https://www.virtualbox.org). VMs have a
84
number advantages for complex informatics analysis (Nocq et al., 2013), of which the
85
pre-installation of all required software for complex tasks as well as temporary access to
86
all the computer resources needed for completion of the task are of most practical value
87
for a typical biologist user. Cloud computing platforms such as OpenStack
88
(https://www.openstack.org) and Docker (https://www.docker.com) offer VM and
89
container-based technologies that can be managed, accessed remotely, and readily
90
deployed on commercial cloud-based services such as Amazon Web Services
91
(https://aws.amazon.com). Government-funded consortia such as the iPlant Collaborative
92
(now CyVerse) (Goff et al., 2011) make such virtual platforms readily accessible to
93
individual users via the internet.
94
Easy to use. Although most genome researchers are familiar with a wide range of online
95
tools to evaluate sequence data, they will not necessarily know how to put them together
96
and configure them appropriately. Ideally, an annotation platform should have a cohesive
97
GUI that guides the user through setup, configuration, parameter setting and status
98
reporting. Importantly, all setup and processing steps should be managed with data sanity
99
checks (for completeness and format), context-dependent menus, error logging and
100
reporting, and help documentation/tutorials.
101
Editable. Ability to edit and improve automated annotation should be built in. This
102
means the ability both to add additional data once the workflow has completed and to
103
modify individual annotations in such a way that the most critical regions of the genome
104
are well annotated.
4
105
Reproducible. With variable parameters and source datasets, automated documentation
106
and simple archiving are essential for ensuring repeatability of the genome annotation
107
process.
108
Scalable. With large genomes and large transcript datasets, computations such as spliced
109
alignment can take days or weeks on a typical lab computer, whereas with access to HPC
110
resources the process can be completed in a few hours. Many research facilities have
111
such resources, but their use is complex and not necessarily available to any researcher
112
who might be interested.
113
Publishable. Once computation is complete, the annotated genome and its input/output
114
files should be available online either to a select community (with password access) or to
115
the research community at a whole, thus placing output data and/or community
116
annotation tools in the hands of the target audience in a timely manner.
117
With the above attributes in mind, we created a self-contained genome annotation
118
platform, xGDBvm, for use by the research community. We report below our initial
119
release of xGDBvm in the iPlant (CyVerse) Atmosphere cloud infrastructure
120
(http://www.iplantcollaborative.org/ci/atmosphere) as an on-demand virtual server for
121
genome annotation that can be adapted for wide range of research needs.
122
123
5
124
RESULTS
125
Overview of xGDBvm
126
xGDBvm is a Linux-based platform that accepts genomic and transcript and/or protein
127
sequence inputs and creates a genome annotation that can be displayed in the included,
128
full-featured genome browser, with separate tracks for genome segments, transcript and
129
protein alignments, gene predictions, and repeat masked regions (Fig. 1). xGDBvm uses a
130
modified and extended version of the xGDB (Extensible Genome Data Broker) Web
131
platform (Schlueter et al., 2006) written in Perl and PHP, along with a Web server,
132
workflow automation scripts, and executables packaged together as a virtual server and
133
configured for access over HTTP or HTTPS via a graphical user interface (GUI).
134
xGDBvm is compact in size, occupying approximately 13 Gigabytes (GB) of a typical 20
135
GB VM root partition. Data inputs/outputs are preferably stored on external volumes
136
mounted to the VM, thus alleviating constraints on VM size.
137
Computational processes in xGDBvm (Fig. 2) are managed by automated, user-
138
configurable workflows, with a built-in option for calls to HPC resources. Optional
139
masking of genome segments is carried out using Vmatch (Abouelhoda et al., 2002)
140
based on user-provided masking libraries. Spliced alignment of transcripts and proteins to
141
the genome are computed using GeneSeqer (Usuka et al., 2000) and GenomeThreader
142
(Gremme et al., 2005) respectively. xGDBvm optionally creates gene model predictions
143
using CpGAT (Comprehensive Gene Annotation Tool; http://plantgdb.org/AtGDB/cgi-
144
bin//WebCpGAT.pl), a set of scripts and binaries that integrates spliced-alignment data
145
and ab initio gene predictions along with BLAST similarity filters and alternative
146
structures to derive a high-quality gene prediction dataset. The xGDBvm workflow can
147
also upload pre-computed gene predictions from a user-provided GFF3-formatted file.
148
All steps are logged and displayed dynamically during workflow operation. Once
149
complete, each feature is displayed as a separate track in a fully featured genome browser
150
complete with search/download tools and tabular feature views. A quality score assigned
151
to each annotated locus facilitates the identification of low-quality models, which can
152
then be re-annotated and curated using the built-in yrGATE annotation tool (Wilkerson et
6
Figure 1. Overview of xGDBvm as implemented at CyVerse (iPlant).
xGDBvm is a virtual server environment for gene structure annotation that can be cloned, configured, populated with input
data, and run from a Web browser in a few steps, summarized here: A. Log in to the CyVerse Atmosphere Control Panel
(https://
atmo.iplantcollaborative.org/application) (1) and click to create a new instance (cloned copy) of xGDBvm (2), create a block
storage volume,for output data, and attach it to the instance (3). Open a Web shell interface (4), accessible from the Control
Panel, and type a series of commands to set up and configure the new xGDBvm instance, also mounting the Data Store and
the attached volume. B. Log in to the CyVerse Data Store cloud storage system (https://de.iplantcollaborative.org/de/) and
upload input data files to an input data directory (accessible to the VM) using a batch uploading tool. Naming conventions are
used to identify each input type. C. Log in to the xGDBvm instance’s Graphical User Interface (GUI) using HTTPS via the
VM’s unique IP address or using a Virtual Network Client (VNC). All subsequent steps are carried out using the xGDBvm GUI.
Authorize the VM to connect to remote HPC resources via the Agave API (http://agaveapi.co) (2). Configure the path to Data
Store inputs and set other parameters including remote job execution (optional). xGDBvm will validate files, return expected
outputs and flag any input file errors (3). Initiate automated workflows and monitor progress (4). The workflow sends some
data remotely for processing on High Performance Computing (HPC) resources
(https://www.xsede.org/) managed by Agave APIs, and processes other files locally using the attached volume as a scratch
disk. The xGDBvm workflow waits for HPC outputs, then proceeds with the annotation process. Output data are written to the
external volume and can be accessed from xGDBvm Web browser as GDB001, GDB002, etc. (5). In addition to a fully
featured genome browser, xGDBvm includes tools to query, update, reannotate, download, or archive outputs to the user’s
Data Store. For details, refer to the xGDBvm wiki (http://goblinx.soic.indiana.edu/wiki/doku.php).
153
al., 2006). Additional genomes can be configured and created with the same VM, and the
154
user can archive and retrieve single or global datasets. Any data type can be appended or
155
replaced using an ‘Update’ feature. The outcome is a rich, editable environment for
7
Figure 2. Data process schema. Input data types (with standardized names as indicated),
computational modules, and outputs are shown. Images are screenshots of color-coded track
glyph types (gene models; splice alignments) and track flags (quality scores) displayed in the
xGDBvm genome browser. Not shown: xGDBvm can also display unknown sequence or repeatmasked regions as a grey bar. See text for details.
156
genome exploration and annotation, accessible locally or remotely on the Web (see Table
157
1 for feature overview).
158
xGDBvm-iPlant
8
159
We implemented xGDBvm as a VM image on iPlant’s Atmosphere cloud platform
160
(https://atmo.iplantcollaborative.org/application), available to registered life sciences
161
researchers (see http://www.iplantcollaborative.org/content/acceptable-use-policy). We
162
further customized the VM taking advantage of iPlant’s data and job execution APIs,
163
making xGDBvm a one-stop destination for genome annotation and display. Registered
164
iPlant users can create and configure an xGDBvm instance via the Atmosphere control
165
panel, and then access the xGDBvm instance via a Web browser to perform all
166
subsequent tasks: validate inputs, run HPC jobs, initiate local workflows, check progress,
167
and view/edit the resulting genome annotation. The genome browser(s) can be made
168
public or private as desired. The following sections detail xGDBvm’s functionality in its
169
current version on iPlant Atmosphere.
170
Inputs and data processing
171
Fig. 3 diagrams the modular architecture used by xGDBvm at iPlant. For managing
172
inputs,
173
(http://www.iplantcollaborative.org/ci/data-store), which provides high capacity storage
174
and tools for quickly uploading user data files. During the xGDBvm configuration
175
process, the user’s Data Store home directory is mounted to the VM’s file system using
176
IRODS FUSE (http://irods.org) and files uploaded to the Data Store are thus accessible
177
on the VM using Unix file system commands. For output data (alignment files, GFF3
178
files, sequence indexes, MySQL database tables, configuration files, and archives), the
179
user can attach a block storage volume to the VM via the Atmosphere control panel, and
180
mount it to the VM’s file system. This data partitioning strategy has the advantage that all
181
data outputs are separate from the VM and do not consume its limited storage capacity
182
while at the same time providing scalability as the data transfer for HPC jobs occurs
183
directly with the data store. Moreover, the complete xGDBvm display can be
184
reconstituted by mounting the volume to a new xGDBvm instance, useful in the event a
185
VM becomes unavailable.
186
Managing files and ensuring validity of inputs (sanity checks) is a challenge for
187
computational pipelines where multiple inputs of various types and formats may be used.
xGDBvm
uses
iPlant’s
Data
Store
cloud
storage
service
9
Figure 3. xGDBvm architecture. An xGDBvm instance, hosted on CyVerse’s
Atmosphere cloud infrastructure (https://atmo.iplantcollaborative.org/application), has separate
file system partitions under root (containing the xGDBvm Web GUI, scripts and binaries, and
other software) and /home/ (which is configured with mount points for the user’s Data Store home
directory for data input and a block storage volume for data output). The Agave API, hosted by
the CyVerse Discovery Environment, is used for authentication of the VM via OAuth2 and for
management of High Performance Computing applications and job submission. A key feature of
xGDBvm is the ability to attach and mount the output volume to a different VM and reconstitute
the annotation outputs and display. See text for details.
188
xGDBvm makes use of filename standardization and extensive validation tools to reduce
189
the incidence of input errors. Each input file is required to be named according to its data
190
type and file format, e.g. ~est.fa for a FASTA file of EST sequences, where “~” is any
191
user prefix, and all input files are placed in a single directory whose path is saved as a
192
configuration variable. In addition, output files (including copies of input files) are all
193
named according to the same conventions, with the GDB number as a prefix, e.g.
194
GDB001est.fa, and deposited in subdirectories according to their type/process. Once an
195
input path has been specified, xGDBvm displays valid filenames in the input directory
196
according to type, displays predicted output tracks, and alerts to any missing files that
197
would compromise output. The user then initiates a script to validate sequence deflines
198
(description lines), error-check IDs and enumerate file contents either singly or in batch
10
199
mode (see Supplemental Fig. 1). File validity metadata are stored along with a unique file
200
stamp, so files need only be validated once unless modified.
201
Supplemental Fig. 2 shows the complete, automated workflow for creating and updating
202
a genome annotation. Typical inputs include a genome sequence assembly and a set of
203
transcript sequences – EST, cDNA, or short read/transcript assembly (TSA) – and/or
204
predicted protein sequences, in FASTA format. Depending on availability, transcripts
205
may be from the same or a closely-related species (Wang et al., 2008). Protein sequences
206
should be from a well-characterized genome as close as possible taxonomically to the
207
target species. With transcript (EST, cDNA or TSA) inputs, xGDBvm will compute
208
spliced alignments, according to user-specified or default parameters, using the
209
multithreaded GeneSeqer-MPI spliced alignment program (Usuka et al., 2000) installed
210
locally or on an HPC server with up to 128 cores. For this step, the user can opt to apply
211
repeat masking to the genome sequence using vmktree/vmatch (Abouelhoda et al., 2002)
212
to reduce computation time, with inclusion of a suitable repeat mask sequence library.
213
Alternatively, the user can provide an N-masked genome file as input. For related-species
214
protein inputs, xGDBvm computes spliced alignments using the GenomeThreader
215
program (Gremme et al., 2005) either locally or on an HPC server. Spliced alignments
216
that meet a quality threshold are ultimately displayed in the xGDBvm genome browser as
217
discrete tracks with standard box-line glyphs to indicated exon/intron boundaries (Fig. 2).
218
The user can also provide GeneSeqer and/or GenomeThreader output files, created
219
offline, as inputs, bypassing the above steps.
220
The xGDBvm workflow next uses spliced alignment data as input for CpGAT, which
221
assembles gene model predictions for the genome. CpGAT uses EVM (EVidence
222
Modeler; http://evidencemodeler.github.io) (Haas et al., 2008) to evaluate GeneSeqer
223
transcript alignments and/ or GenomeThreader protein spliced alignments, together with
224
ab initio gene finder results from BGF (http://bgf.genomics.org.cn), GeneMark
225
(http://exon.gatech.edu/GeneMark/)
226
(http://bioinf.uni-greifswald.de/augustus/) (Stanke et al., 2006), and derives an optimal
227
set of transcript models which are then BLASTed against a reference protein dataset (if
228
supplied by the user). In addition, some PASA (Haas et al., 2003) functions are used to
(Borodovsky
et
al.,
2003),
and
Augustus
11
229
aggregate splice variant models where indicated by evidence alignments. Optionally the
230
user can request repeat masking of the genome prior to ab initio gene prediction. The
231
output from CpGAT is a set of BLAST-filtered or unfiltered gene model structures for
232
each genome segment, complete with coordinates for start/stop codon and predicted
233
UTRs where possible, in GFF3 format, which are loaded to the xGDBvm database.
234
Several CpGAT parameters are user-configurable with the xGDBvm GUI, allowing the
235
user to select species model or bypass ab initio gene finders, relax reference protein
236
BLAST filtering, or request repeat masking, and the complete set of CpGAT parameters
237
can be modified by editing the CpGAT configuration file.
238
As a final step, xGDBvm calculates the GAEVAL score for each gene model, consisting
239
of a set of statistics representing the degree of congruence of the model with available
240
alignment evidence (http://plantgdb.org/GAEVAL/docs/index.html). GAEVAL also
241
reports alternative splicing evidence and classifies annotation errors into discrete types
242
such as gene fusion, gene fission, etc. GAEVAL data summaries are displayed in
243
xGDBvm as a flag associated with each track glyph (Schlueter et al., 2005).
244
Users can also upload pre-computed genome annotations provided as GFF3 file(s) along
245
with optional transcript and translation FASTA files. These data are displayed in the form
246
of a separate annotation track, with GAEVAL scores calculated as described above. If
247
gene descriptions are available in tabular form, these can also be uploaded to augment
248
gene annotation tracks.
249
250
251
xGDBvm setup, configuration and data processing
252
xGDBvm was designed to be easy to configure and run (Fig. 1). As a supplement to
253
online help and video tutorials (see below), beginning users can consult the xGDBvm
254
wiki
(http://goblinx.soic.indiana.edu/wiki/doku.php)
which
includes
step-by-step
12
255
instructions and information about how to choose the correct VM size and storage
256
capacity for their particular genome annotation needs.
257
After instance creation, the user accesses the shell via a terminal emulator or the
258
Atmosphere’s built-in shell emulator and types a series of simple commands to configure
259
and password-protect the VM environment. Subsequent steps are accomplished using a
260
Web browser connecting to the VM via HTTPS, or by connecting to the VM using a
261
virtual network computing (VNC) client (Atmosphere offers a built-in VNC window as
262
well). xGDBvm’s hierarchical user interface is organized by task type: Manage, View,
263
Annotate, and Help, with submenus under each section. Under Manage are Admin
264
(manage site passwords, admin emails, and yrGATE users); Configure/Create (create or
265
update a genome browser); and Remote Jobs (configure and manage remote HPC jobs;
266
see next section). End-user oriented sections include View (browse/analyze genomes),
267
and Annotate (submit/manage user annotations). Each section and subsection includes a
268
Getting Started page that outlines the suggested workflow along with key links and one
269
or more Help pages with detailed documentation including video tutorials that can be
270
viewed on the VM. Contextual popup help dialogs are also provided for each page/step.
271
Under Manage → Configure/Create, a user can check volume capacity of the VM,
272
manage license keys for certain installed software and then consult a decision tree to
273
guide them to the correct data sources, a table of filename conventions, and a guide to
274
CpGAT annotation. Once the data files are in place, the user clicks ‘Create New GDB’,
275
selects a file path pointing to the data input files, enters any non-default parameters as
276
well as genome metadata, and then saves the configuration setup, which is assigned
277
‘Development’ status and an ID (GDB001, etc.) that will be associated with the output
278
database (Fig. 4A). The user can now click to validate file contents as described above.
279
To initiate data processing, the user selects ‘Data Process Options’ followed by ‘Create
280
GDB’, which changes status to ‘Locked’, initiates the central data processing workflow,
281
and displays a running report of progress together with any errors. The workflow can be
282
aborted at any time by clicking the ‘Abort’ button under ‘Data Process Options’; this
283
removes all dynamically created directories and kills all associated processes, returning
284
the configuration to ‘Development’ status. On successful workflow completion, GDB
13
Figure 4. xGDBvm data management. A. Screenshot of the GDB Configuration GUI, set up for processing Example
data. Each genome annotation is assigned a unique identifier (GDB001, GDB002, etc.) and a user-provided name. In
addition to form fields for input data path, annotation parameters, and metadata, this page provides extensive colorcoded information about all system settings (e.g. license keys, storage capacity, login status, displayed in blue-green),
input data validity (light green), and expected output (orange). The GUI includes buttons that launch modal windows to
initiate computational workflow or edit configuration. B. Screenshot of Archive/Delete GUI, showing genome databases
with ‘Current’ (blue; computation complete) or
‘Development’ (grey; not yet run) status. Each table row displays information about a GDB including time stamps as well
as action buttons that allow the user to Drop, Delete, Archive, Delete Archive, or Copy database (see text for details).
Global action buttons (top right) allow the user to Delete or Archive all data on the VM. C. Screenshot of List All Jobs GUI
with tools to monitor and manage remote HPC jobs. The GUI displays IDs, job metadata, time stamps, color-coded status
indicators and action buttons to manage output (Stop Job, Delete Job, View Logs, Copy Output) via the Agave API; see
text for details.
285
status is changed to ‘Current’ and the new genome is added to the View menu structure.
286
Input datasets, annotation statistics, and output datasets can be viewed online. Output
14
287
errors are logged and displayed to the user along with context-specific help dialogs
288
(Supplemental Fig. 3).
289
Any of several lightweight, preconfigured sample datasets (Supplemental Fig. 4) can be
290
loaded with a single button click from the ‘Create New’ page and then saved and
291
processed to a finished GDB in no more than a few minutes. Because these examples
292
cover the complete range of processes and workflows in the xGDBvm code, they also
293
serve as functional tests for functionality when first setting up an xGDBvm instance or
294
modifying its code.
295
High-performance computing option
296
On multi-processor VMs, xGDBvm automatically invokes parallel processing where
297
possible, for certain computational steps (See Supplemental Figure 1). This can speed up
298
spliced alignment and genome annotation (CpGAT) jobs, in that more than one genome
299
segment can be evaluated concurrently on separate processor threads. As an alternative
300
for even more processing power, xGDBvm is capable of sending input data for spliced
301
alignment jobs to high-performance computing facilities, either as a standalone job or as
302
part of an annotation workflow. For this option, the user’s input data must be on a VM-
303
mounted iPlant Data Store directory and assigned to a GDB with ‘Development’ status.
304
GeneSeqer-MPI and GenomeThreader binaries, along with wrapper scripts for job
305
submission to an HPC server, are installed in iPlant’s Discovery Environment
306
(https://de.iplantcollaborative.org/de/) as executable ‘apps’. Client access to HPC
307
resources and apps is managed via the Agave API (Dooley et al., 2012),
308
http://agaveapi.co, which provides an open-source platform for interacting with
309
computational
310
(https://www.xsede.org/). xGDBvm uses Agave’s implementation of the OAuth2
311
(http://oauth.net) standard for authorization and subsequent authentication to use apps.
312
Under Manage → Remote Jobs, users first submit their iPlant username/password in
313
return for OAuth2 credentials that are stored securely on the VM and allow access to
314
remote applications (GeneSeqer-MPI and GenomeThreader). The user can then log in
315
and obtain a temporary access token and refresh token for authentication. The VM-
resources
that
are
managed
under
the
XSEDE
system
15
316
cached refresh token is also used by local scripts to re-authenticate API access during
317
automated workflow processing. The user can select the app size (i.e. number of
318
processors) for optimal efficiency given their genome size and complexity and then
319
return to the GDB Configuration page, select the ‘remote’ option for spliced alignment,
320
and initiate the automated workflow. The xGDBvm workflow script copies relevant input
321
data (genome, transcript and/or protein) to a temporary directory on the user’s mounted
322
Data
323
(https://curl.haxx.se) to a custom wrapper script (see Fig. 3). The wrapper script accepts
324
parameters, splits and indexes input files as appropriate for multiple processors, and then
325
issues a command to launch GeneSeqer-MPI or GenomeThreader on the specified HPC
326
server cluster. The xGDBvm workflow updates remote job status periodically using a
327
callback URL to xGDBvm and/or email notification service. Output data are copied to
328
specified subdirectory on the user’s Data Store, directory where xGDBvm’s workflow
329
can access them for further processing. Remote job details and status are tracked by
330
xGDBvm, and users can access job lists, query remote job status, and kill a remote job
331
using the Manage → Remote Jobs GUI (Fig. 4C).
332
Remote GeneSeqer or GenomeThreader spliced alignment jobs can also be run as a
333
standalone process via Manage → Remote Jobs. Output is archived on the users’ Data
334
Store directory, and xGDBvm can be directed to evaluate the output and copy output files
335
to an input directory for inclusion in workflow processing.
Store
directory
and
issues
a
job
submission
command
via
cURL
336
337
338
Logging / troubleshooting
339
Each step in xGDBvm’s computational workflow script (see Supplemental Fig. 2) is
340
displayed dynamically during automated workflow operation and saved in a process log.
341
Common errors (e.g. mismatch in data input/output, incorrect format, duplicate IDs) are
342
flagged and logged in an error file, along with user hints to remedy the problem (see
343
Supplemental Fig. 3). A separate file is created for logging CpGAT progress.
16
344
Outputs and data analysis tools
345
xGDBvm displays the output of workflow processing as schematized glyphs, organized
346
into color-coded tracks, in a full-featured genome browser (Fig. 5). Standard tracks
347
include EST, cDNA, TSA (transcript sequence assembly), and protein spliced
348
alignments; pre-computed and CpGAT gene predictions; and regions that have been
349
repeat masked or assigned as spacer regions (N-substituted). Additional user-generated
350
tracks include yrGATE annotations and region-specific CpGAT annotations. Advanced
351
users can create unlimited additional tracks by manually populating new data tables and
352
modifying configuration files. The xGDBvm genome browser has track features similar
353
to those currently available at http://plantgdb.org (zoom/scroll; show/hide or reorder
354
tracks; change font size; view base pair level). The genome browser also includes a suite
355
of analysis tools including search and retrieve for sequence or subsequence regions
356
(introns, exons, up/downstream regions); NCBI-BLAST for sequence queries within or
357
across genomes; region-specific GenomeThreader and CpGAT tools; and the ability to
358
add a custom track from a local GFF file. Complementing the Genome Context View are
359
searchable, tabular views for each Feature Track type ordered by genome position. The
360
Gene Models table displays annotated loci along with structural metadata, similarity
361
descriptions, GAEVAL gene quality/coverage, and yrGATE annotation status (see
362
below). The Aligned Proteins and Aligned Transcripts tables display splice-aligned
363
sequences of each type with filters for alignment quality/coverage and links to alignment
364
details. A separate page for GAEVAL Scores displays comprehensive gene quality data
365
based on comparison of gene predictions with alignment evidence and offers multiple
366
search filters.
367
All inputs, outputs, and archives (see below) are stored hierarchically under
368
/xGDBvm/data/GDBnnn/data/, and they are also available for download to local storage
369
using the VM’s GUI (View → GDBnnn → Data Download). Using this download
370
service, the user could for example retrieve GFF-formatted annotation outputs from
371
CpGAT for use in further analysis or display on a different genome browser. Data files
372
can also be copied to the Data Store either manually or by creating and copying a GDB
373
Archive (see below).
17
Figure 5. Genome context view. Shown is a typical region from the Capsella rubella genome
annotation described under Results. Genome span is shown in yellow, and genome features
(tracks) are as labeled to the left and above each track, and drag-and-drop reorder and “hide
track” features are implemented here. Top bar provides search and navigation controls; left bar
contains links to tools and views, as well as to configuration and help pages. Region submenu
(orange) contains zoom/scroll, region-specific tools and formatting controls. See Table 1 for
details of xGDBvm tools and features.
374
Updating or adding tracks
375
In cases where the user may wish to append or replace data, xGDBvm includes an Update
376
branch to the data workflow allowing any track to be appended or replaced. The user sets
377
an ‘Update’ flag on the configuration page, specifies a directory where update data
18
378
resides, and selects the data type(s) and update action(s) desired. The user then clicks
379
‘Update’, which adds or replaces data inputs and re-runs appropriate scripts to update the
380
genome data tables, indices and display. All update actions are logged in the same way as
381
a new GDB, appended to the same process log.
382
The xGDBvm wiki (http://goblinx.soic.indiana.edu/wiki/) includes complete instructions
383
for adding additional annotation or alignment tracks beyond the five standard tracks
384
available. Users familiar with MySQL and the necessary computational steps can
385
completely customize an instance of xGDBvm, using pre-computed data as inputs.
386
Managing xGDBvm datasets
387
Output datasets can be managed on the Manage → Config/Create → Archive/Delete
388
page (Fig. 4B). For archiving a GDB, the entire output directory tree is compressed as a
389
tar archive and stored in an Archive directory under /xGDBvm/data/ArchiveGDB/, and
390
the archive can be copied to the user’s Data Store with a single button click. If the
391
corresponding GDB is later dropped (see below) or becomes corrupted, the archive can
392
be readily restored using the ‘Restore from Archive’ button. GDB archives also facilitate
393
sharing data with other researchers, who can use the ‘Restore from Archive’ function to
394
load any archive to their own VM. In addition, all GDB can be archived together using
395
the ‘Archive All’ function. Any ‘Current’ xGDBvm database can be discarded using the
396
‘Drop’ button. This removes all GDB-associated directories and their output data, but
397
preserves the GDB ID and its stored configuration data, allowing users to build on the
398
previous configuration or restore (see above) a GDB. Finally, the most recently added
399
GDB can be deleted using ‘Delete’, or all GDB can be deleted using ‘Delete All’.
400
Reannotating with yrGATE
401
A key feature of xGDBvm is the ability to flag low-quality gene structures and improve
402
them in-place by manual re-annotation. For each genome displayed on xGDBvm, the
403
Gene Models page provides filters to select high coverage / low integrity models (based
404
on GAEVAL quality score and coverage) that might be improved by manual inspection
405
(Fig. 6A). Users can create an annotation login account and correct, confirm, or
19
Figure 6. Gene model improvement using yrGATE. A. A published gene model from Capsella
rubella (Carubv1011418m.g) showing high coverage/low integrity in the Locus Table (upper table,
highlighted columns). B. Corresponding gene model in genome context view (blue glyph). CpGAT
annotated this region as two distinct loci (magenta glyph), backed up by both Arabidopsis protein
(black) and cDNA (light blue). The region was then re-annotated using yrGATE (dark and light
green glyphs) to confirm the most probably genic structure of this region based on available
evidence. yrGATE glyphs are color-coded according to the type assigned by the annotator, e.g.
dark green (improved structure); light green (new structure not previously annotated).
406
disqualify any gene prediction using the yrGATE annotation tool (Wilkerson et al.,
407
2006); see Fig. 6B. The yrGATE tool offers point-and-click simplicity for building a
408
gene structure, enhanced by dynamic reporting of GAEVAL scores to guide the user to
409
the best possible model based on evidence alignments. yrGATE includes curation tools
410
for users who are assigned Administrator status, providing a quality check for submitted
411
annotations prior to their display. All re-annotation and curation steps are carried out in a
412
single browser window with portals to NCBI BLAST and other analysis tools, and users
413
can manage their own annotations (save, submit for curation, delete) on the Community
414
Central pages. Administrative features include the ability to assign users to annotation
415
working groups, track annotation totals for each user, and configure one or more email
416
addresses for administrative notification. Once curated, yrGATE annotations are
20
417
displayed as a separate track in the xGDBvm genome browser with color-coding to
418
indicate re-annotation class (Fig. 6B), and these can be downloaded in GFF3 or FASTA
419
format.
420
Benchmarking xGDBvm
421
Whole genome annotation. Capsella rubella is an Arabidopsis relative with a sequenced
422
genome totaling 134.8 Mb (Slotte et al., 2013). We evaluated xGDBvm as a tool for new
423
genome annotation using the Capsella rubella genome assembly (see Methods for
424
sequence sources and parameters). We obtained both Arabidopsis thaliana cDNA
425
sequences and A. thaliana predicted proteins as input for evidence alignments. We first
426
computed high-quality transcript and protein spliced alignments using the ‘standalone’
427
HPC job submission tool in an xGDBvm instance at iPlant. The GeneSeqer-MPI job (8
428
processors with 64 threads) and GenomeThreader job (2 processors with 12 threads)
429
finished in 7 hr and 1 hr, respectively. These outputs were used as input for an annotation
430
workflow (with CpGAT option selected) in xGDBvm. The CpGAT reference dataset was
431
the entire set of UniRef90 Viridiplantae proteins (see Methods). In addition, the C.
432
rubella annotation dataset (in GFF3 format) was uploaded to xGDBvm for comparison.
433
The annotation of 873 scaffolds was completed in approx. 12 days on a single core
434
processor VM with 4 GB RAM. The results are shown in Table 2. xGDBvm completed
435
49,947 cDNA spliced alignments and 28,595 protein spliced alignments. The CpGAT
436
annotation generated 25,498 gene models, compared to 28,447 gene models from the
437
published C. rubella annotation. A total of 4,368 loci from the published annotation had
438
no match in the CpGAT set (as determined by overlap), while 861 loci were unique to
439
CpGAT. Comparison of 19,892 loci with gene models from both CpGAT and the
440
published annotation using ParsEval (Standage and Brendel, 2012) revealed a high level
441
of congruence between the two data sets. More than 60% of the gene models compared
442
had identical coding sequences. At the level of individual exons, the sensitivity (true
443
positive rate) was 69% and the specificity (true negative rate) was 68%, or 89% and 88%
444
respectively if restricted to coding exons. At the level of individual nucleotides, the
445
sensitivity and specificity were 97% and 96%, respectively. These data demonstrate the
446
reliability of CpGAT as a workflow for producing a provisional genome annotation (our
21
447
purpose is not to present a detailed comparison of these two annotations; the respective
448
evidence alignment datasets and thresholds were likely not identical, making such
449
detailed analysis complex).
450
Re-annotation of low quality predictions. We evaluated GAEVAL gene quality for the
451
Capsella rubella annotation dataset on a locus basis by setting a locus table filter for
452
average integrity < 75% and coverage > 75%. This filter resulted in 254 questionable loci
453
with likely annotation errors for CpGAT models compared to 558 questionable models in
454
the published annotation set (Table 2). This subset represents models for which re-
455
annotation has a high probability of improving gene prediction via the yrGATE tool. We
456
chose an example of a locus from the published annotation that was flagged by GAEVAL
457
as possibly erroneous, Carbubv1011418.m.g (Fig. 6). The CpGAT annotation for this
458
region was split into two distinct, complete gene structures, identified as scaffold_1.g5.t1
459
and scaffold_1.g6.t1 Using the yrGATE tool, we confirmed the CpGAT models as more
460
accurately representing the evidence alignments (dark and light green tracks in Fig. 6B).
461
Genome region. Another use for xGDBvm is to annotate a genome segment containing a
462
specific gene or region of interest. This would typically be a rapid turnaround analysis
463
compared to whole genome analysis and thus could be carried out using internal
464
computing resources, possibly repeatedly under different parameter regimes. As an
465
example, we used a Setaria italica predicted protein, annotated as ‘stem-specific protein
466
TSJT1-like’ as a tBLASTn query against the Musa acuminata subsp. Malaccensis whole
467
genome sequence data in GenBank. We retrieved a contig (839) that contained a region
468
of high similarity to this sequence (see Methods). We then configured xGDBvm inputs
469
consisting of Musa genomic contig 839, the current Musa acuminata EST dataset from
470
GenBank, and the predicted protein translations from the annotated genome of a related
471
monocotyledonous
472
(http://www.brachypodium.org). The workflow included gene prediction using CpGAT
473
with UniRef90 proteins from Viridiplantae as a reference dataset (see Methods). The
474
CpGAT output included 4 evidence-based loci and 12 ab initio predicted genes, including
475
a model fully supported by transcript alignment in the region with high similarity to
476
XP_004977556 (Supplemental Fig. 5).
plant
species
Brachypodium
distachyon
22
477
xGDBvm implementation
478
iPlant. xGDBvm has been deployed as a public image on iPlant’s Atmosphere Cloud
479
Service (https://atmo.iplantcollaborative.org/application). Researchers can launch an
480
xGDBvm instance and explore it once they have obtained an iPlant user account
481
(https://user.iplantcollaborative.org/register/) using an institutional email address. An
482
iPlant account also grants the user a home page on iPlant’s Data Store. Step-by-step
483
instructions
484
http://goblinx.soic.indiana.edu/wiki/doku.php?id=user_instructions, can be summarized
485
as follows: 1) In the Atmosphere Control Panel, find the latest xGDBvm image, launch
486
an instance, and attach an external block storage volume using drag-and-drop; 2) Access
487
the instance’s secure shell using iPlant credentials and type simple commands to update
488
xGDBvm code, set a Web password, initialize IRODS/FUSE, mount external storage,
489
and launch a configuration script; 3) Access the VM’s GUI via HTTPS or VPN and
490
follow instructions there to configure/create a genome annotation.
491
Indiana University. xGDBvm has also been implemented on a ‘production’ virtual
492
server
493
(http://goblinx.soic.indiana.edu/PdomGDB), a genome database for Polistes dominula
494
(European paper wasp), as well as the test datasets described here (see Data Access).
495
PdomGDB provides a showcase for the xGDBvm platform, including the addition of
496
extra nonstandard feature tracks created using methods outlined in the xGDBvm wiki
497
(http://goblinx.soic.indiana.edu/wiki/doku.php?id=configure_new_track). PdomGDB is
498
actively being updated by the Polistes research community using the yrGATE tool for
499
contributing expert-curated gene annotations, as described in this manuscript (accepted
500
submissions
501
http://goblinx.soic.indiana.edu/yrGATE/GDB001/CommunityCentral.pl). This website
502
also includes general information on the xGDBvm project on the project home page
503
(http://goblinx.soic.indiana.edu/index.php).
504
Public
505
http://brendelgroup.github.io/xGDBvm/.
at
for
setting
Indiana
repository.
up
xGDBvm,
University,
serving
available
as
are
The
xGDBvm
a
on
the
host
for
Wiki
PdomGDB
accessible
project
maintains
at
at
a
presence
at
The xGDBvm-specific software can be
23
506
accessed
507
developers can contribute via git pull requests, and users can screen pending issues and
508
report new ones. xGDBvm is licensed under Gnu General Public License, version 3. The
509
repository includes case studies that illustrate real-world projects implemented using
510
xGDBvm (https://github.com/BrendelGroup/xGDBvm/tree/master/case-studies/).
and
updated
from
https://github.com/BrendelGroup/xGDBvm,
where
511
512
24
513
DISCUSSION
514
xGDBvm’s utility
515
As an all-in-one solution to genome annotation and analysis, xGDBvm is unique among
516
currently available packages. Configured as a virtual server with a complete GUI
517
interface and HPC capabilities, xGDBvm removes barriers to entry imposed by extensive
518
software installation, testing and troubleshooting, and command-line operation. The
519
xGDBvm GUI guides inexperienced users by presenting only actionable choices and
520
instructions at each step, as well as providing pre-installed sample datasets, input data
521
validation, error flagging, and extensive help pop-ups. Data management is handled
522
entirely within the xGDBvm environment, allowing the user to focus on the overall
523
annotation task rather than managing intermediate input/output files. The resulting Web
524
site can be either public or password-protected as desired, and the contents can be
525
archived, shared, or exported for display using other genome display platforms. We
526
expect that this combination of features will make xGDBvm attractive to research groups
527
with a desire to annotate genome data but limited access to informatics support.
528
There are several use cases for xGDBvm in its current implementation at iPlant:
529
1) Researchers with a newly assembled genome who can quickly align relevant transcript
530
assembly and/or protein data to determine probable gene location and then perform gene
531
structure computation on either a portion of the genome or the genome in its entirety,
532
resulting in a “first pass” genome annotation.
533
2) Researchers with a recently annotated genome who wish to share it and improve
534
annotation quality via community annotation.
535
3) Researchers who wish to create their own copy of a ‘finished’ genome annotation in
536
order to run gene quality analyses with up-to-date transcript data, and/or carry out
537
targeted or general re-annotation.
25
538
4) Instructors desiring a hands-on environment for exploring the principles of genome
539
annotation with real data and access to HPC resources.
540
In scope, xGDBvm provides an easy-to-use and versatile platform for annotating and
541
analyzing genomes at various stages of completion. At one extreme, a finished genome
542
can be loaded from data files available online, giving the user complete freedom to
543
analyze and re-annotate genes previously published. At the opposite extreme, a newly
544
assembled genome can be loaded together with related-species data and/or short read
545
assemblies, and CpGAT can be invoked to automatically build a credible draft genome
546
annotation for further analysis. With any implementation, the powerful built-in tools for
547
gene quality analysis and re-annotation make xGDBvm a valuable asset for improving
548
genome structure annotation as well.
549
Another advantage of xGDBvm is its flexibility, as it allows multiple genome views to be
550
created in one instance and supports updates to any type of existing data. Finally,
551
xGDBvm provides extensive documentation of the annotation and update process,
552
important both for troubleshooting and for reporting results.
553
Comparison to similar tools
554
Other cloud-based annotation tools are available: Maker (http://www.yandell-
555
lab.org/software/maker.html) is a eukaryotic genome annotation pipeline that can be
556
installed in a variety of server environments (Cantarel et al., 2008) and a version of
557
Maker (Maker-P) is installed at iPlant Atmosphere as a virtual machine with links to HPC
558
(https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+at+iPlant).
559
web-based genome analysis platforms Galaxy (Goecks et al., 2010) offers cloud
560
installation
561
(https://aws.amazon.com/ec2/). xGDBvm differs from these tools in that it offers a
562
comprehensive package combining a structured environment for data inputs, automated
563
data processing with sanity checks, and tools for genome display, search and re-
564
annotation built in.
565
Limitations
via
Amazon’s
Elastic
Cloud
Compute
(EC2)
The
service
26
566
As currently configured, xGDBvm is unable to map short read data onto a genome, so
567
users will need to assemble short reads de novo, prior to submitting data to xGDBvm as a
568
TSA dataset. xGDBvm’s computational workflow can currently accommodate only one
569
track per spliced alignment data type (EST, cDNA, TSA, Protein), and two tracks for
570
gene model predictions. Users who require additional tracks must configure them
571
manually. xGDBvm’s HPC processes are currently limited to spliced alignment
572
computations, whereas gene structure annotation via CpGAT is limited by the processing
573
power of the VM.
574
VM availability and usage at iPlant, as well as access to HPC resources, can be expected
575
to be limited based overall capacity and the amount of demand on the respective systems.
576
Users wishing to increase their usage quotas may be required to justify their request.
577
Future directions
578
xGDBvm is still being developed and improved. The road map includes additional
579
features such as modular data workflows allowing unlimited track numbers, and
580
additional options for gene annotation and evaluation. xGDBvm’s implementation of the
581
Agave API should facilitate the addition of new standalone or pipeline-integrated
582
computation tools that can take advantage of high performance processing (e.g. Maker).
583
We also envision integrating xGDBvm with other analysis platforms including one that
584
allows visualization of common introns (Wilkerson et al., 2006).
585
586
27
587
METHODS
588
xGDBvm architecture and software
589
The xGDBvm architecture is shown in Fig. 3, and a more detailed description can be
590
found in the wiki (http://goblinx.soic.indiana.edu/wiki/). We currently maintain two
591
parallel implementations of xGDBvm, one at Indiana University (xGDBvm-GoblinX) on
592
a virtual server using Red Hat Enterprise Linux (http://www.redhat.com), and the other
593
on
594
(https://www.centos.org).
595
(http://www.apache.org) with very similar configurations, but xGDBvm-iPlant also
596
includes openSSL (https://www.openssl.org) and Apache’s mod_ssl for secure access
597
over
598
(https://www.mysql.com), Perl (http://www.perl.org) and PHP (http://php.net/) to handle
599
web scripts and some server-side functions, with additional Perl modules for cgi and
600
session management. Installed Javascript libraries include JQuery and JQuery UI
601
(https://jquery.com). BioPerl (http://www.bioperl.org/wiki/Main_Page) and EMBOSS
602
(http://emboss.sourceforge.net) were installed to handle certain operations. Additional
603
binaries, including NCBI-BLAST+ (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/+)
604
as well as the computation-related software described earlier, were installed under
605
/usr/local/bin/ or /usr/local/src/ (see Supplemental Table 1 for a complete list of installed
606
binaries).
607
The document root directory is /xGDBvm/ under the VM’s root partition. xGDB scripts
608
(modified from Schlueter et al. (2006)), PHP scripts, and other assets (Javascript files, css
609
files and images) were installed under /xGDBvm/XGDB/, and administrative scripts
610
under
611
/xGDBvm/scripts/, and custom yrGATE, GAEVAL and CpGAT packages were installed
612
under /xGDBvm/src/. The entire document root contents (excluding binaries) is
613
maintained
614
(https://github.com/BrendelGroup/xGDBvm).
the
iPlant
HTTPS.
Atmosphere
Both
Additional
/xGDBvm/admin/.
as
platform
(xGDBvm-iPlant)
implementations
software
includes
Workflow-related
a
public
run
using
CentOS
Linux
Apache
web
server
MySQL
shell
scripts
repository
client/server
are
found
at
software
under
GitHub
28
615
The xGDBvm architecture is designed to segregate input data, dynamically generated
616
output data, and static web scripts that comprise the xGDBvm core (see Fig. 3). The
617
user’s Data Store directory (for inputs, segregated under a common subdirectory
618
xgdbvm/) and block storage volume (for outputs) are mounted under /home/xgdb-input/
619
and /home/xgb-data/, respectively. These are symbolically linked to paths under the
620
document root (/xGDBvm/input and /xGDBvm/data), and all xGDBvm scripts reference
621
these data paths for reading and writing data. Data destination directories are assigned
622
ownership by group ‘xgdb’ with read-write privileges, and the ‘apache’ user is added to
623
the ‘xgdb’ group under /etc/group. Temporary data are saved to /xGDBvm/data/tmp.
624
To provide secure transactions where passwords are being sent over the Web, xGDBvm-
625
iPlant enforces HTTPS (with self-signed cert) on all pages. Website password protection
626
via .htaccess is required upon initial configuration, so only users who have the password
627
can view the website online. Password protection can also be modified using the
628
xGDBvm Admin GUI to include just the Manage functions (Admin, Configure/Create
629
and Remote Jobs); in this configuration, the VM’s genome browsers and data download
630
sections are public. The back-end MySQL password can also be customized via the GUI
631
for additional site security. Web access to the mounted storage directories is blocked by
632
the Apache configuration, so the user’s mounted disks are not exposed on the Internet.
633
Certain VM assets (OAuth2 credentials, MySQL password) are stored under
634
/xGDBvm/admin/ which is protected via the Apache configuration.
635
Benchmarking xGDBvm
636
The hardmasked Capsella rubella assembly (Slotte et al., 2013) was downloaded from
637
JGI
638
psf.org/pub/compgen/phytozome/v9.0/Crubella/assembly/Crubella_183_hardmasked.fa.g
639
z; user account required). Arabidopsis thaliana cDNA FASTA sequences were
640
downloaded
641
est"[Filter]) AND Arabidopsis thaliana[Organism]). Predicted protein translations were
642
obtained
643
(ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/). UniRef90 proteins
(ftp://ftp.jgi-
from
from
NCBI
the
(http://www.ncbi.nlm.nih.gov/nuccore?term=("mrna
Arabidopsis
TAIR10
genome
not
release
29
644
from
645
(http://www.uniprot.org/uniref/?query=uniprot:(taxonomy:”Viridiplantae+[33090]”)+ide
646
ntity:0.9) and the file renamed as UniRef90-Viridiplantae.fa. A genome annotation based
647
on these input data was created on an xGDBvm instance at iPlant with 2 CPUs and 4 GB
648
RAM.
649
parameters were species model:Arabidopsis, alignment stringency:strict. CpGAT
650
parameters were BGF:Arabidopsis, Augustus:arabidopsis, GeneMark:a_thaliana; Skip
651
Mask=T. For comparison, the current C. rubella annotation (GFF3) was downloaded
652
(ftp://ftp.jgi-
653
psf.org/pub/compgen/phytozome/v9.0/Crubella/annotation/Crubella_183_gene.gff3.gz)
654
and included as input in the genome workflow. Additional spliced alignment
655
benchmarking and case studies used GeneSeqer-MPI and GenomeThreader running on
656
high performance computing systems at Texas Advanced Computing (TACC;
657
https://www.tacc.utexas.edu), accessed from xGDBvm as public apps via the Agave API.
658
For the second use case, we queried the NCBI whole genome shotgun sequence (wgs)
659
library
660
http://www.ncbi.nlm.nih.gov/assembly/GCF_000313855.1/)
661
(http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=tblastn) with a Setaria italica
662
predicted protein (XP_004977556.1). Musa acuminata contig 839 (GenBank accession
663
CAIC01023586.1)
664
(http://www.ncbi.nlm.nih.gov/Traces/wgs/fdump.cgi?CAIC01,23586); the resulting file
665
was named Musa_contig_839.gdna.fa, and the FASTA header was simplified to
666
“>Musa_contig839”. Musa acuminata EST sequences in FASTA format were retrieved
667
from
668
(http://www.ncbi.nlm.nih.gov/nucest?term=Musa_acuminata%5BOrganism%5D])
669
renamed as musa_est.fa. UniRef90 proteins from Viridiplantae were retrieved in FASTA
670
format as described above. xGDBvm’s GeneSeqer parameters were species model:rice,
671
alignment stringency:strict. CpGAT parameters were BGF:rice, Augustus:maize,
672
GeneMark:o_sativa; Skip Mask=T.
673
Accession Numbers
Viridiplantae
were
retrieved
in
FASTA
(https://atmo.iplantcollaborative.org/application).
for
Musa
acuminata
was
subsp.
retrieved
format
from
xGDBvm’s
Uniprot
GeneSeqer
malaccensis
(banana;
using
tblastn
from
NCBI
NCBI
and
30
674
Datasets described under Benchmarking can be viewed and downloaded from the
675
xGDBvm project pages at http://goblinx.soic.indiana.edu/GDB002/ (Capsella rubella
676
genome) and http://goblinx.soic.indiana.edu/GDB003/ (Musa acuminata contig 839). A
677
list of all Web resources referenced in this manuscript is found in Supplemental Table 2.
678
679
SUPPLEMENTAL DATA
680
681
Supplemental Figure 1. Input Data Validation.
682
Supplemental Figure 2. The xGDBvm automated workflow.
683
Supplemental Figure 3. Output data validation.
684
Supplemental Figure 4. Preconfigured example datasets.
685
Supplemental Figure 5. Annotation of a single genomic contig.
686
Supplemental Table 1.. xGDBvm Installed Software.
687
Supplemental Table 2. Hyperlinks referenced in the manuscript.
688
689
ACKNOWLEDGMENTS
690
We thank Ann Fu for help with initial development of the automated workflow, Shannon
691
Schlueter for advice in adapting his XGDB core code for the virtual environment, James
692
Denton for extensive debugging and yrGATE feature development, Jianqing Guan for
693
code to calculate dynamic GAEVAL scores, and Bruce Shei for system support at
694
Indiana University. We especially thank collaborators and colleagues at the iPlant
695
Collaborative (CyVerse) and Texas Advanced Computing Center (TACC) for their
696
assistance in integrating xGDBvm into the Atmosphere cloud environment and the Agave
697
API: Roger Barthelson and Shabari Subramaniam, who wrote and tested HPC wrapper
698
scripts for GeneSeqer-MPI and GenomeThreader, respectively; Andre Mercer, who
699
provided prototype PHP scripts for the API; and Edwin Skidmore, Rion Dooley, and
31
700
Matthew Vaughn who provided system troubleshooting and advice. This work was
701
supported by NSF award #1221984 to V. Brendel.
702
703
AUTHOR CONTRIBUTIONS
704
V.B. conceived the project and provided overall guidance; J.D carried out the project and
705
managed collaborations; D.S. tested xGDBvm functionality with actual datasets,
706
configured and extended a production xGDBvm server, ran ParsEval comparisons, and
707
contributed
708
implementation at iPlant and created the prototype HPC wrapper scripts.
709
FIGURE LEGENDS
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
Figure 1. Overview of xGDBvm as implemented at CyVerse (iPlant).
xGDBvm is a virtual server environment for gene structure annotation that can be cloned,
configured, populated with input data, and run from a Web browser in a few steps, as
summarized here: A. Log in to the CyVerse Atmosphere Control Panel
(https://atmo.iplantcollaborative.org/application) (1) and click to create a new instance
(cloned copy) of xGDBvm (2), create a block storage volume,for output data, and attach
it to the instance (3). Open a Web shell interface (4), accessible from the Control Panel,
and type a series of commands to set up and configure the new xGDBvm instance, also
mounting the Data Store and the attached volume. B. Log in to the CyVerse Data Store
cloud storage system (https://de.iplantcollaborative.org/de/) and upload input data files to
an input data directory (accessible to the VM) using a batch uploading tool. Naming
conventions are used to identify each input type. C. Log in to the xGDBvm instance’s
Graphical User Interface (GUI) using HTTPS via its unique IP address or using a Virtual
Network Client (VNC) (1). All subsequent steps are carried out using the xGDBvm GUI.
Authorize the VM to connect to remote HPC resources via the Agave API
(http://agaveapi.co) (2). Configure the path to Data Store inputs and set other parameters
including remote job execution (optional). xGDBvm will validate files, return expected
outputs and flag any input file errors (3). Initiate automated workflows and monitor
progress (4). The workflow sends some data remotely for processing on High
Performance Computing (HPC) resources (https://www.xsede.org/) managed by Agave
APIs, and processes other files locally using the attached volume as a scratch disk. The
xGDBvm workflow waits for HPC outputs, then proceeds with the annotation process.
Output data are written to the external volume and can be accessed from xGDBvm Web
browser as GDB001, GDB002, etc. (5). In addition to a fully featured genome browser,
xGDBvm includes tools to query, update, reannotate, download, or archive outputs to the
user’s Data Store. For details, refer to the xGDBvm wiki
(http://goblinx.soic.indiana.edu/wiki/doku.php).
some
parsing
scripts;
N.M.
provided
guidance
for
xGDBvm’s
32
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
Figure 2. Data process schema. Input data types (with standardized names as indicated),
computational modules, and outputs are shown. Images are screenshots of color-coded
track glyph types (gene models; splice alignments) and track flags (quality scores)
displayed in the xGDBvm genome browser.
Figure 3. xGDBvm architecture. An xGDBvm VM instance, as hosted on the CyVerse
Atmosphere cloud infrastructure (https://atmo.iplantcollaborative.org/application), has
separate file system partitions under root (containing the xGDBvm Web GUI, scripts,
binaries, and other software) and /home/ (which is configured with mount points for the
user’s Data Store home directory for data input and a block storage volume for data
output). The Agave API, hosted by the CyVerse Discovery Environment, is used for
authentication of the VM via OAuth2 and for management of High Performance
Computing applications and job submission. A key feature of xGDBvm is the ability to
attach and mount the output volume to a different VM and reconstitute the annotation
outputs and display. See text for details.
Figure 4. xGDBvm data management. A. Screenshot of the GDB Configuration page,
set up for processing Example data. Each genome annotation is assigned a unique
identifier (GDB001, GDB002, etc.) and a user-provided name. In addition to form fields
for input data path, annotation parameters, and metadata, this page provides extensive
color-coded information about all system settings (e.g. license keys, storage capacity,
login status, displayed in blue-green), input data validity (light green), and expected
output (orange). The form includes buttons that launch modal windows to initiate
computational workflow or edit configuration. B. Screenshot of Archive/Delete menu,
showing genome databases with ‘Current’ (blue; computation complete) or
‘Development’ (grey; not yet run) status. Genome annotations are identified as GDB001,
GDB002, etc. Each table row displays information about a GDB including time stamps as
well as action buttons that allow the user to Drop, Delete, Archive, Delete Archive, or
Copy database (see text for details). Global action buttons (top right) allow the user to
Delete or Archive all data on the VM. C. Screenshot of List All Jobs page with tools to
monitor and manage remote HPC jobs. The page displays IDs, job metadata, time stamps,
color-coded status indicators and action buttons to manage output (Stop Job, Delete Job,
View Logs, Copy Output) via the Agave API; see text for details.
Figure 5. Genome context view. Shown is a typical region from the Capsella rubella
genome annotation described under Results. Genome span is shown in yellow, and
genome features (tracks) are as labeled to the left and above each track, and drag-anddrop reorder and “hide track” features are implemented here. Top bar provides search and
navigation controls; left bar contains links to tools and views, as well as to configuration
and help pages. Region submenu (orange) contains zoom/scroll, region-specific tools and
formatting controls. See Table 1 for details of xGDBvm tools and features.
Figure 6. Gene model improvement using yrGATE. A. A published gene model from
Capsella rubella (Carubv1011418m.g) showing high coverage/low integrity in the Locus
Table (upper table, highlighted columns). B. Corresponding gene model in genome
33
context view (blue glyph). CpGAT annotated this region as two distinct loci (magenta
glyph), backed up by both Arabidopsis protein (black) and cDNA (light blue). The region
was then re-annotated using yrGATE (dark and light green glyphs) to confirm the most
probably genic structure of this region based on available evidence. yrGATE glyphs are
color-coded according to the type assigned by the annotator, e.g. dark green (improved
structure); light green (new structure not previously annotated).
TABLES
Table 1. xGDBvm features
Section
1
Feature
Administrate
Manage
Create/Configure
View GDB:
Feature Tracks
View GDB
Remote Jobs
GDB Home Page
Genome Context View
Gene Predictions (Loci)
Aligned Proteins,
Aligned Transcripts
GAEVAL scores
Download region
View GDB: Tools (Genome Context
View)
783
784
785
786
787
788
789
790
791
792
793
Download data
Search ID or keyword
Functions
Modify password protection; customize site name;
administer yrGATE user accounts
Configure new GDB; validate input files; view/edit
configuration; Initiate, monitor automated
workflows; view log files; archive/restore/delete
GDB, copy archive to Data Store
Configure OAuth2 login, job APIs, App IDS; submit
jobs; view job status; manage jobs (CyVerse login
required)
GDB summary data; view genome region or search
for sequence
View all tracks by genome segment and region;
zoom, jump up or downstream, view nucleotide level
alignments
All annotated loci and metadata in tabular view;
search/filter queries; yrGATE summaries for each
locus; download as .csv
All spliced alignments in tabular views; search/filter
queries; download as .csv
Detailed gene quality scores for each Gene
Prediction track; search/filter queries
Download any sequence type from region as
FASTA; download annotations from region in GFF3
or NCBI format
Download individual input files, output files (all
types), or GDB archive files to the local drive
Search and retrieve FASTA sequence or subsequence
(introns, exons, up/downstream) for any feature
displayed on GDB
Blast GDB
Match sequence within GDB
Blast All GDB
Match sequence across multiple GDB
CpGAT annotate region
Regional gene predictions and quality scores
34
Add Custom Track
Add custom track from local GFF3 file
GenomeThreader region
Regional spliced alignment of proteins
yrGATE
Tool for creating/submitting user-contributed
annotations; with portals to NCBI ORF finder; NCBI
BLAST; GENSCAN; GeneMark; CpGAT
Searchable list of curated yrGATE (user-submitted)
annotations; Download annotations (FASTA, GFF3)
Manage User Annotations (admin account & login
required)
View Group Annotations (admin account & login
required)
Curate user-submitted Annotations (admin account &
login required)
User instructions and video tutorials; also available
as contextual help popups
Documentation and instructions for
users/admins/developers
Source code; Issue tracking; Case studies
Community Central
Annotate
My Annotations
My Groups
My Admin
Help
Help pages
794
xGDBvm Wiki
(external)
Github Repository
(external)
1- as implemented on iPlant (CyVerse) Atmosphere cloud service
35
795
796
797
798
1
Table 2. Annotation of the Capsella rubella genome
Genome
segments
853
799
800
801
802
803
Total length
(bp)
134,834,574
1234-
A. thaliana cDNA
spliced alignments
Total
Cognate3
A. thaliana
protein
spliced
alignments
49,947
44,870
34,629
CpGAT gene predictions
Published gene predictions2
Transcripts
Loci
Questionable4
Transcripts
Loci
Questionable4
25,498
22,698
254
28,447
26,521
558
See also http://goblinx.soic.indiana.edu/GDB002/ for data display and download
Source: ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Crubella/annotation/Crubella_183_gene.gff3.gz
The single location with the best alignment score for a given query sequence
Less than 75% integrity score and greater than 75% coverage based on GAEVAL analysis (see Methods)
36
Parsed Citations
Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E. (2002). The enhanced suffix array and its applications to genome analysis. In
Second Workshop on Algorithms in Bioinformatics (Springer-Verlag), pp. 449-463.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Borodovsky, M., Mills, R., Besemer, J., and Lomsadze, A. (2003). Prokaryotic gene prediction using GeneMark and GeneMark.hmm.
In Current Protocols in Bioinformatics, Chapter 4, Unit 4 5, A.D. Baxevanis, ed (
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Cantarel, B.L., Korf, I., Robb, S.M., Parra, G., Ross, E., Moore, B., Holt, C., Sanchez Alvarado, A., and Yandell, M. (2008). MAKER: an
easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18, 188-196.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Dooley, R., Vaughn, M., Stanzione, D., Terry, S., and Skidmore, E. (2012). Software-as-a-Service: The iPlant Foundation API. In 5th
IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) (IEEE).
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Foissac, S., Gouzy, J.P., Rombauts, S., Mathé, C., Amselem, J., Sterck, L., Van de Peer, Y., Rouzé, P., and Schiex, T. (2008).
Genome annotation in plants and fungi: EuGene as a model platform. Current Bioinformatics 3, 87-97.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Goecks, J., Nekrutenko, A., and Taylor, J. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and
transparent computational research in the life sciences. Genome Biol 11, R86.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Goff, S., Vaughn, M., McKay, S., Lyons, E., Stapleton, A., Gessler, D., Matasci, N., Wang, L., Hanlon, M., Lenards, A., Muir, A.,
Merchant, N., and al., e. (2011). The iPlant Collaborative: Cyberinfrastructure for Plant Biology. Front Plant Sci 2.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Gremme, G., Brendel, V., Sparks, M.E., and Kurtz, S. (2005). Engineering a software tool for gene structure prediction in higher
organisms. Information and Software Technology 47, 965-978.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Grigoriev, I.V., Nordberg, H., Shabalov, I., Aerts, A., Cantor, M., Goodstein, D., Kuo, A., Minovitsky, S., Nikitin, R., Ohm, R.A., Otillar,
R., Poliakov, A., Ratnere, I., Riley, R., Smirnova, T., Rokhsar, D., and Dubchak, I. (2012). The genome portal of the Department of
Energy Joint Genome Institute. Nucleic Acids Res 40, D26-32.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Haas, B., Salzberg, S., Zhu, W., Pertea, M., Allen, J., Orvis, J., White, O., Buell, C.R., and Wortman, J. (2008). Automated eukaryotic
gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol 9, R7.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Haas, B., Delcher, A., Mount, S., Wortman, J., Smith, R., Hannick, L., Maiti, R., Ronning, C., Rusch, D., Town, C., Salzberg, S., and
White, O. (2003). Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res
31, 5654 - 5666.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Hammesfahr, B., Odronitz, F., Hellkamp, M., and Kollmar, M. (2011). diArk 2.0 provides detailed analyses of the ever increasing
eukaryotic genome sequencing data. BMC Res Notes 4, 338.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Hoff, K.J., Lange, S., Lomsadze, A., Borodovsky, M., and Stanke, M. (2016). BRAKER1: Unsupervised RNA-Seq-Based Genome
Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32, 767-769.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Holt, C., and Yandell, M. (2011). MAKER2: an annotation pipeline and genome-database management tool for second-generation
genome projects. BMC Bioinformatics 12, 491.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Leroy, P., Guilhot, N., Sakai, H., Bernard, A., Choulet, F., Theil, S., Reboux, S., Amano, N., Flutre, T., Pelegrin, C., Ohyanagi, H.,
Seidel, M., Giacomoni, F., Reichstadt, M., Alaux, M., Gicquello, E., Legeai, F., Cerutti, L., Numa, H., Tanaka, T., Mayer, K., Itoh, T.,
Quesneville, H., and Feuillet, C. (2012). TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant
Genomes. Front Plant Sci 3, 5.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Mungall, C.J., Misra, S., Berman, B.P., Carlson, J., Frise, E., Harris, N., Marshall, B., Shu, S., Kaminker, J.S., Prochnik, S.E., Smith,
C.D., Smith, E., Tupy, J.L., Wiel, C., Rubin, G.M., and Lewis, S.E. (2002). An integrated computational pipeline and database to
support whole-genome sequence annotation. Genome Biol 3, research0081-0081.0011.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Nocq, J., Celton, M., Gendron, P., Lemieux, S., and Wilhelm, B.T. (2013). Harnessing virtual machines to simplify next-generation
DNA sequencing analysis. Bioinformatics 29, 2075-2083.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Potter, S.C., Clarke, L., Curwen, V., Keenan, S., Mongin, E., Searle, S.M., Stabenau, A., Storey, R., and Clamp, M. (2004). The
Ensembl analysis pipeline. Genome Res 14, 934-941.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Reddy, T.B.K., Thomas, A.D., Stamatis, D., Bertsch, J., Isbandi, M., Jansson, J., Mallajosyula, J., Pagani, I., Lobos, E.A., and
Kyrpides, N.C. (2015). The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level
(meta)genome project classification. Nucleic Acids Res 43, D1099-1106.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Schlueter, S.D., Wilkerson, M.D., Dong, Q., and Brendel, V. (2006). xGDB: open-source computational infrastructure for the
integrated evaluation and analysis of genome features. Genome Biol 7, R111.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Schlueter, S.D., Wilkerson, M.D., Huala, E., Rhee, S.Y., and Brendel, V. (2005). Community-based gene structure annotation.
Trends Plant Sci 10, 9-14.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Slotte, T., Hazzouri, K.M., Agren, J.A., Koenig, D., Maumus, F., Guo, Y.L., Steige, K., Platts, A.E., Escobar, J.S., Newman, L.K., Wang,
W., Mandakova, T., Vello, E., Smith, L.M., Henz, S.R., Steffen, J., Takuno, S., Brandvain, Y., Coop, G., Andolfatto, P., Hu, T.T.,
Blanchette, M., Clark, R.M., Quesneville, H., Nordborg, M., Gaut, B.S., Lysak, M.A., Jenkins, J., Grimwood, J., Chapman, J.,
Prochnik, S., Shu, S., Rokhsar, D., Schmutz, J., Weigel, D., and Wright, S.I. (2013). The Capsella rubella genome and the genomic
consequences of rapid mating system evolution. Nat Genet 45, 831-835.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Specht, M., Stanke, M., Terashima, M., Naumann-Busch, B., Janssen, I., Hohner, R., Hom, E.F., Liang, C., and Hippler, M. (2011).
Concerted action of the new Genomic Peptide Finder and AUGUSTUS allows for automated proteogenomic annotation of the
Chlamydomonas reinhardtii genome. Proteomics 11, 1814-1823.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Standage, D., and Brendel, V. (2012). ParsEval: parallel comparison and analysis of gene structure annotations. BMC
Bioinformatics 13, 187.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Stanke, M., Keller, O., Gunduz, I., Hayes, A., Waack, S., and Morgenstern, B. (2006). AUGUSTUS: ab initio prediction of alternative
transcripts. Nucleic Acids Res 34, W435-439.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Thibaud-Nissen, F., Souvorov, A., Murphy, T., DiCuccio, M., and Kitts, P. (2013). GNOMON (Eukaryotic Genome Annotation
Pipeline). In The NCBI Handbook [Internet]. 2nd edition. (Bethesda (MD): National Center for Biotechnology Information (US)), pp.
http://www.ncbi.nlm.nih.gov/books/NBK169439/.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Uberbacher, E.C., Hyatt, D., and Shah, M. (2004). GrailEXP and Genome Analysis Pipeline for genome annotation. In Current
protocols in human genetics, J.L. Haines, ed, pp. Unit 6 5.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Usuka, J., Zhu, W., and Brendel, V. (2000). Optimal spliced alignment of homologous cDNA to a genomic DNA template.
Bioinformatics 16, 203-211.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Wang, B.B., O'Toole, M., Brendel, V., and Young, N.D. (2008). Cross-species EST alignments reveal novel and conserved
alternative splicing events in legumes. BMC Plant Biol 8, 17.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Wilkerson, M.D., Schlueter, S.D., and Brendel, V. (2006). yrGATE: a web-based gene-structure annotation tool for the identification
and dissemination of eukaryotic genes. Genome Biol 7, R58.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
Yandell, M., and Ence, D. (2012). A beginner's guide to eukaryotic genome annotation. Nature Reviews Genetics 13, 329-342.
Pubmed: Author and Title
CrossRef: Author and Title
Google Scholar: Author Only Title Only Author and Title
xGDBvm: A Web GUI-driven workflow for annotating eukaryotic genomes in the cloud
Jon Duvick, Daniel S Standage, Nirav Merchant and Volker P Brendel
Plant Cell; originally published online March 28, 2016;
DOI 10.1105/tpc.15.00933
This information is current as of June 17, 2017
Supplemental Data
/content/suppl/2016/03/28/tpc.15.00933.DC1.html
Permissions
https://www.copyright.com/ccc/openurl.do?sid=pd_hw1532298X&issn=1532298X&WT.mc_id=pd_hw1532298X
eTOCs
Sign up for eTOCs at:
http://www.plantcell.org/cgi/alerts/ctmain
CiteTrack Alerts
Sign up for CiteTrack Alerts at:
http://www.plantcell.org/cgi/alerts/ctmain
Subscription Information
Subscription Information for The Plant Cell and Plant Physiology is available at:
http://www.aspb.org/publications/subscriptions.cfm
© American Society of Plant Biologists
ADVANCING THE SCIENCE OF PLANT BIOLOGY

Download Report

xGDBvm: A Web GUI-driven workflow for annotating

Paperzz.com

Your Paperzz