The Triticeae Toolbox: Combining Phenotype and Genotype Data to

Published June 15, 2016
o r i g i n a l r es e a r c h
The Triticeae Toolbox: Combining Phenotype and
Genotype Data to Advance Small-Grains Breeding
Victoria C. Blake, Clay Birkett, David E. Matthews, David L. Hane,
Peter Bradbury, and Jean-Luc Jannink*
Abstract
The Triticeae Toolbox (http://triticeaetoolbox.org; T3) is the
database schema enabling plant breeders and researchers to
combine, visualize, and interrogate the wealth of phenotype
and genotype data generated by the Triticeae Coordinated
Agricultural Project (TCAP). T3 enables users to define specific
data sets for download in formats compatible with the external
tools TASSEL, Flapjack, and R; or to use by software residing on
the T3 server for operations such as Genome Wide Association
and Genomic Prediction. New T3 tools to assist plant breeders
include a Selection Index Generator, analytical tools to compare
phenotype trials using common or user-defined indices, and a
histogram generator for nursery reports, with applications using
the Android OS, and a Field Plot Layout Designer in development. Researchers using T3 will soon enjoy the ability to design
training sets, define core germplasm sets, and perform multivariate analysis. An increased collaboration with GrainGenes and
integration with the small grains reference sequence resources will
place T3 in a pivotal role for on-the-fly data analysis, with instant
access to the knowledge databases for wheat and barley. T3
software is available under the GNU General Public License and
is freely downloadable.
The Triticeae Toolbox
A Brief History
T
he last generation of plant breeders have employed
marker-assisted selection (Dubcovsky, 2004) wherein
each line is evaluated with only a few genetic markers.
New technologies have shifted the paradigm to an analysis of high-density (between a thousand and a million),
genome-wide markers at equal or lower cost. In addition
to high-throughput sequencing, automated phenotype
measurement methods are rapidly progressing (Montes et al., 2011; Furbank and Tester, 2011), though some
types of automation remain specialized and expensive.
The TCAP, funded by the United States Department of
Agriculture’s National Institute of Food and Agriculture
(USDA-NIFA), aims to use these data-rich technologies to discover genes in wheat (Triticum aestivum L.)
and barley (Hordeum vulgare L.) that will contribute to
improved water and nitrogen use efficiency in future
varieties. To this end, TCAP is capitalizing on current
genotyping and sequencing technologies and applying
new phenotyping strategies like canopy spectral reflectance (CSR) that indeed bring plant breeding into the
realm of “Big Data” (Howe et al., 2008; Marx, 2013). To
V.C. Blake and D.L. Hane, USDA-ARS Western Regional Research
Center, 800 Buchanan St. Albany, CA; C. Birkett, D.E. Matthews, P.
Bradbury, and J.-L. Jannink, USDA-ARS Robert Holley Center, Tower
Rd., Ithaca, NY. Received 30 Dec. 2014. Accepted 2 Nov. 2015.
*Corresponding author ([email protected]).
Published in Plant Genome
Volume 9. doi: 10.3835/plantgenome2014.12.0099
© Crop Science Society of America
5585 Guilford Rd., Madison, WI 53711 USA
This is an open access article distributed under the CC BY-NC-ND
license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
the pl ant genome

ju ly 2016

vol . 9, no . 2 Abbreviations: CSR, canopy spectral reflectance; GBS, genotypeby-sequencing; GRIN, Germplasm Resource Information Network;
GWAS, genome-wide association studies; NSGC, National Small
Grain Collections; SNP, single nucleotide polymorphism; T3, The
Triticeae Toolbox; TCAP, Triticeae Coordinated Agricultural Project;
THT, The Hordeum Toolbox.
1
of
10
Table 1. Public databases using The Triticeae Toolbox (T3) schema.
Database
T3 Wheat
T3 Barley
T3 Oat
Breeder’s Datafarm
(wheat)
Breeder’s Datafarm
(barley)
URL
Description
Support†
triticeaetoolbox.org/wheat
triticeaetoolbox.org/barley
triticeaetoolbox.org/oat
malt.pw.usda.gov/t3/bd/
genotype/phenotype database and analysis tools for wheat (Triticum spp.)
genotype/phenotype database and analysis tools for barley (Hordeum spp.)
genotype/phenotype database and analysis tools for oat (Avena spp.)
genotype/phenotype database for US Regional Uniform Nurseries measuring traits
related to Fusarium spp. infection in wheat
genotype/phenotype database for US Regional Uniform Nurseries measuring traits
related to Fusarium spp. infection in barley
TCAP
TCAP
NAMA, AAFC
USWBSI
malt.pw.usda.gov/t3/bdb/
USWBSI
† AAFC, Agriculture and Agri-Food Canada; NAMA, North American Millers’ Association; TCAP, USDA-ARS Triticeae Coordinated Agriculture Project; USWBSI, U.S. Wheat and Barley Scab Initiative.
handle this quantity of data within TCAP, increased data
handling and improved tools for users and curators were
clearly going to be needed.
While plant genomic databases like GrainGenes
(Matthews et al., 2003; Carollo et al., 2005) and MaizeGDB (Lawrence et al., 2004; Sen et al., 2009) are rich in
genetic information such as maps and sequences, they
are primarily intended as knowledgebases that facilitate
access to published results of genetic analyses but not to
the original data and analyses themselves. Thus, T3 was
adopted from The Hordeum Toolbox (THT; Blake et al.,
2012) to integrate and provide access to phenotypic and
genotypic data generated by TCAP.
The Triticeae Toolbox is actually several distinct
databases (Table 1) holding data in separate production
databases for barley, wheat, and oat, and results of U.S.
Uniform Regional Nurseries for the U.S. Wheat and Barley Scab Initiative. Sandbox databases are available to data
contributors to test the conformity of their data files before
submitting them to the managing curator, who loads the
data onto a production database. Sandboxes are mirrored
nightly from the production databases. T3 is open-source
and freely available at https://github.com/Dave-Matthews/
The-Triticeae-Toolbox (accessed 14 Aug. 2014; verified 2
Feb 2016). Installation instructions are given.
Germplasm with data in T3/wheat and T3/barley
represents the most promising populations from breeding programs of research institutions in the United States
participating in the TCAP project. The National Small
Grain Collections (NSGC) of both wheat and barley are
also being extensively used for the TCAP project, whose
mandate, in part, is to discover genes for improved water
use and N efficiency from within the vast collections held
by the USDA. The Germplasm Resource Information
Network’s (GRIN) decades of data taken from the 1960s
through the 1990s by USDA scientists and plant breeders
archives (Harold Bockelman, personal communication,
2010) was mined and loaded onto T3, contributing tens
of thousands of additional phenotype data points for the
NSGC lines used in the TCAP.
Years of phenotype and genotype data collected in
prior barley and wheat research, the TCAP, GRIN, and
other miscellaneous programs are now held in T3 (Table
2). The phenotypic data spans greenhouse trials and field
nurseries grown across North America in all small grain
2
of
10
Table 2. Database content in The Triticeae Toolbox (T3)
Wheat and T3 Barley in May 2015.
Database content†
Trials
Phenotype trials
Genotype trials
TCAP data programs
Lines
Line records
Breeding programs
Lines with genotyping data
Lines with phenotype data
Phenotype data
Traits
Total phenotype data
Genotype data
Non-SNP markers
SNP markers
GBS markers
Total genotype data
T3 Wheat
T3 Barley
307
50
19
633
111
27
13,192
49
9,351
8,655
38,781
24
15,967
9,413
144
407,781
111
423,282
2,301
105,246
3547,662
157,333,013
1,001
9,641
55,782
49,534,252
† GBS, genotype-by-sequencing; SNP, single nucleotide polymorphism; TCAP, Triticeae Coordinated
Agricultural Project.
growing regions. T3 barley also holds data from Bolivia,
Ethiopia, and Hungary. Traits in T3 are partitioned into categories that are customized to best reflect the types of data
collected for each species. T3 Barley and T3 Wheat each
have trait categories for agronomic, morphological, and
disease-related traits. T3 Wheat reports quality traits in one
category, and T3 Barley has the three additional categories:
winter growth habit, malt quality, and food/feed quality.
A major strength of T3 is the ease with which data
can be loaded. No programming or database skills are
required to upload data, as T3 was developed to accept
upload files that can easily be created by data contributors and loaded as Microsoft Excel or text files. These
data are integrated immediately into the database,
assuring that T3 has the most up-to-date information
at all times. A library of current upload template files is
available, and individual template files are embedded
with further instructions to complete the data files. The
project maintains Sandbox versions of the databases
for participants to format and troubleshoot their data
in preparation for submission to the managing curator.
the pl ant genome

ju ly 2016

vol . 9, no . 2
The curator experience is interactive, and the T3 team
has strived to deliver clear and explicit messages to aid
troubleshooting upload files when errors occur.
Hardware
The server holding the production versions of T3 has
1.5 TB of disk space, 512 GB of RAM, and 32 CPU cores
at 2.0 Ghz. The Sandbox versions, the development
databases, and the Breeders Datafarms run on different
machines (Table 1), assuring that most users can access
the tools and resources on T3 without competing for
processing time with the developers and curators. The T3
website and database use standard components: Linux,
Apache HTTP Server (Apache Software Foundation),
a MySQL (My Structured Query Language) relational
database, and PHP (PHP hypertext preprocessing) programming language. Additional packages are used for
analysis and visualization of data: R script language
(Ihaka and Gentleman, 1996, R Development Core Team,
2008), X3DOM (Behr et al., 2009), and ViroBLAST
(Deng et al., 2007).
Database Initialization, Data Acquisition,
and Maintenance
The core design of T3 allows new projects to adapt easily,
but several constraints require consensus by the project
members for the resulting data to be useful. Members
must agree on a controlled vocabulary, as well as scales
and units for all phenotypic traits. The managing curator
needs to upload a great deal of background information
(i.e., institutions, project members, data types, and trait
descriptions) before the flow of data can begin. Germplasm records must exist in T3 before data can be loaded.
T3 requires the primary name to be one string, therefore
some names will require underscores to connect words
(ex. Chinese Spring becomes Chinese_Spring) while
spaces can simply be removed between letters and numbers (ex. Sumai 3 becomes Sumai3).
Data is submitted to T3 by means of templates that
are either comma- or tab-delimited text or Excel (Microsoft Corp., Redmond, WA) files, depending on the data
type. The underlying principle in designing T3, preserved
from that of THT, is to enable researchers to submit
data using simple templates to organize data generated
by their programs. A simple Web-based interface on T3
interacts with managing and user-curators concerning
the format of their data (for example, using the most upto-date template), or the integrity of the data (for example, verifying that measurements are within the range set
for a particular trait). This enables one curator to manage
data produced from a large number of sources. Curator
tools built into T3 enable data managers to easily edit
many of the fields of existing records, add new traits, and
delete phenotype trials, enabling T3 to remain current.
Some tools in T3, such as 3D Cluster Lines by Genotype, and Genomic Association and Prediction tools,
may require extensive calculations. Estimated execution time is calculated based on the number of lines
bl ake et al .: triti ceae toolbox for sm all gr ains and markers after filtering. When T3 estimates that the
execution time will be over 2 min, the estimated time
appears, and a link is provided to check if the job is finished. If the user is logged in to T3, then T3 will notify
the user by email on completion. For shorter calculations, T3 shows a progress indicator until the calculation
completes, then the page finishes loading.
Genotype Data Storage and Retrieval
To quickly retrieve large amounts of data from the
database, the genotype information is stored into twodimensional tables. This allows T3 to retrieve all the
genotype information for specific lines or markers as
one block of data. Records are also linked to experiment,
program, and genotyping platforms. Because T3 serves
to help a large group of PIs collaborate, emphasis was
placed on developing a simple interface to find datasets
coming from different groups and then collate them for
ease of downstream analysis. The menus designed for
T3 reflect the workflow that we expect in such analyses,
grouping together functions to select data, to analyze the
resulting selections, or to download them for analysis on
the client side. Our most innovative selection tool is the
Wizard that allows the user to search and combine lines
or phenotypes that share breeding program origins, were
evaluated in the same locations, or for the same traits.
Alternatively, starting from a previously selected set of
lines, it is possible to identify trials where any or all lines
were evaluated. These functions allow data from disparate trials to be quickly identified and brought together,
ensuring that data management is not an impediment to
collaboration. Complementing these search-and-combine
functions, T3 enables users to define panels of germplasm
lines for repeated use. For more complex analysis than
allowed by internal T3 tools, data can be downloaded formatted specifically for tools like TASSEL (Bradbury et al.,
2007), rrBLUP (Endelman, 2011), FlapJack (Milne et al.,
2010), and synbreed (Wimmer et al., 2012).
Genome-Wide Association Studies
The most exciting new feature in T3 is the ability to perform genome-wide association studies (GWAS; Risch
and Merikangas, 1996) in real time on selected data (Fig.
1). To access GWAS tools in T3, users click the Analyze
menu bar, and the wizard will walk them through a
series of steps beginning with assembling a dataset that
has both phenotypes and markers for one trait. Upon
returning to the GWAS interface through the Analyze
tab, users will be prompted to select a genetic map and
T3 will calculate the number of markers mapped for the
line set in the queue per given map in T3, allowing for
the best marker coverage in downstream applications.
With all necessary elements selected, T3 will now
give users access to GWAS and genomic prediction
tools (Hamblin et al., 2011). Both tools use the R package “rrBLUP” (Endelman, 2011). To execute a GWAS,
the user specifies the number of principal component
axes to be used as fixed effects and whether to use an
3
of
10
Fig. 1. (continued on next page) Genomic Association and Prediction Suite in T3. (A–D) Prediction tools for plump grain using unique
nurseries in Bozeman, MT, in 2008 and 2012. (E–G) GWAS tools using the plump grain trait in the CAPWUE_2012_HighN_Bozeman
trial. (A) Interface showing selected training and validation phenotype trials, with information about overlapping germplasm lines
and interactive text boxes to change Minimum minor allele frequency (MAF) and missing marker and line tolerance in the data. (B)
Histogram comparing phenotype scores for “breeders plump grain” in the selected trials. (C) Population structure visualization plotting
lines using the first two principal component eigenvectors. (D) Accuracy of the predicted values in the Validation Population. (E)
Scree plot of the principal component eigenvalues. (F) Q-Q plot showing observed GWAS p-values against expected under the null
hypothesis. (G) Manhattan plot of GWAS and top five markers scored in the analysis.
4
of
10
the pl ant genome

ju ly 2016

vol . 9, no . 2
Fig. 1. Continued.
bl ake et al .: triti ceae toolbox for sm all gr ains 5
of
10
EMMA or EMMAX type of random effect control for
population structure (Kang et al., 2010). The page then
displays a histogram of the phenotypic observations, a
scree-plot of the principal component eigenvalues, a Q-Q
plot of observed versus expected–log10 p-values, and a
Manhattan plot (any unmapped markers in the dataset
are binned in a chromosome 0). If multiple trials are
selected, a fixed effect for trial is added to the model. If
multiple traits are selected, only the first trait is analyzed.
In the example in Fig. 1, the trait “breeders plump
grain” was used to determine if data generated in 2008
in a diverse North American collection of breeder’s lines
could successfully predict the outcome of a similar set of
lines grown in the same location in 2012. To begin with,
the trait category “malt quality” was selected, then “breeders plump grain,” and the trial CAP3_2008_BozemanIrr,
a 778 line irrigated nursery, was selected. The genetic map
UCR04162008 (Szucs et al., 2009) was selected to provide
marker data, which had the best coverage for the selected
lines with 3072 of 4608 markers mapped. Data was filtered
using the default parameters (minimum minor allele
frequency £ 5%; Remove markers missing > 10% of data;
Remove lines missing > 10% of data) resulting in a training set with 2171 markers on 773 lines.
We show here a synthesis of results from both GWAS
and genomic prediction tools. GWAS was implemented
using three principal components as fixed effects to control for population structure. The mixed model used the
EMMAX method. A separate validation set, a TCAP nursery CAPWUE_2012_HighN_Bozeman (256 lines with 74
removed because they were also in the training set), was
chosen. Predictions based on model training from the 2008
data had a prediction accuracy of 0.41 on that validation set.
Rather than GWAS, the same inputs can be sent to
a cross-validation analysis of genomic prediction using
ridge regression, which is equivalent to a GBLUP analysis
(Lorenz et al., 2011). In that case, five-fold cross-validation
is performed on the data. The output is the histogram of
phenotypes, a scatter-plot of the genotypes using the first
two principal component eigenvectors, and a scatter-plot
of the observed phenotype against its cross-validated
prediction. Prediction accuracy is evaluated separately
in each trial and observed versus prediction points in
the scatterplot are given in a different color for each trial.
Alternatively, the user may choose to select a second
target dataset for prediction or cross-validation. In that
case, the first dataset serves solely to train the prediction
model. If the second dataset has the trait being modeled
in the training set, T3 produces a scatter plot of observed
against predicted phenotypes and calculates the correlation between the two in this validation set. Otherwise, T3
shows the same cross-validation plot described above. In
all cases, a link is provided to a comma-separated values
file with the lines, their cross-validated or direct prediction, and an indicator of whether the lines were in the
training set or in the target set. When a second dataset
is selected, the principal component scatterplot is also
modified to show both training and target datasets.
6
of
10
While significant markers may commonly arise in a
GWAS analysis when testing just one trial, when analyzing the genetic effect on a trait over many locations under
varying conditions, more data is required. To test the
hypothesis that more phenotype data in T3 for a given
trait will increase the likelihood that a significant marker
is discovered among several environments, GWAS was
performed on data for the trait “breeders plump grain”
that was available for all trials in THT in 2007 and 2008
(Waugh et al., 2009) and the data currently held in T3,
which is nearly triple the germplasm tested. As shown
in Fig. 2, data in combined locations in 2007 and 2008
was not sufficient to generate a significant marker, but
by 2015 the increase in phenotype data in T3 enables the
GWAS tools to identify two significant markers so T3
draws a horizontal LOD score line on the Manhattan plot
to indicate significance.
Genotype-By-Sequencing Big Data in T3
Genotype-by-sequencing (GBS) is a low cost, highthroughput method to employ next-generation sequencing in genomes whose complexity has been reduced, usually by restriction enzymes (Elshire et al., 2011; Poland
et al., 2012). The T3 project has strived to stay current
with GBS data for wheat and barley, and a list of trials
can be accessed via the “Genotype experiments” link on
the left “Quick links” menu, or from the “Content status” page under the “About T3” menu. While there are
synonymous markers discovered among the early wheat
GBS trials (Table 3), until there are reference sequences
available, the markers will remain distinct within the
trials, and no synonyms will be created, thus demoting a
marker to an alias rather than a primary marker name.
GBS markers in that are published without marker
names are processed in the following manner:
1. The markers are filtered for duplicates and those
whose reported single nucleotide polymorphism
(SNP) position is beyond the length of the marker.
2. The A and B SNP alleles are alphabetized and placed
in brackets within the marker sequence, keeping in
mind that the published SNP position may begin
with 0 rather than 1.
3. Markers are filtered again for duplicates that arise
with the alphabetization.
4. Markers are then ordered alphabetically and named
with a prefix to identify the experiment (e.g.,
gbsHWW, gbsCNL) and numbered sequentially.
Selection Index Generator
A selection index, first developed for animal breeders (Hazel, 1943), ranks breeding lines for multiple
traits simultaneously, then computes a single number
representing a combined score for all the traits. Plant
breeders will find the selection index generator in T3
intuitive and flexible for helping to select germplasm to
advance in programs and evaluate nurseries for traits of
the pl ant genome

ju ly 2016

vol . 9, no . 2
Fig. 2. Manhattan plots from genome-wide association studies to find significant markers for “breeders plump grain” with the increasing
amount of data available in 2007, 2008, and 2015.
Table 3. Number of synonymous markers among the
early wheat genotype by sequencing (GBS) trials in
The Triticeae Toolbox. Total markers in the trial are indicated in parentheses after the trial name.
GBS Trial Name
SynOP_GBS_2012BinMap
(19,777)
HW WAMP_2013_GBS
(58,172)
CornellMaster_2013_GBS
(48,069)
HWWAMP_
2013_GBS
(58,172)
CornellMaster_
2013_GBS
(48,069)
HWWAMP_
2014_GBS
(1243,145)
5,026
4,076
882
–
26,218
2,913
–
–
2,199
interest. Though conceptually intuitive, complications
arise because the numerical magnitude and the variance
of some traits are greater than others, and because the
desirable values of a trait may not always align to moreis-better rule (e.g., malt protein, heading date).
In the design of the T3 Selection Index generator,
the user can choose to rescale the trait values either by
standardizing to a distribution with a mean of zero and
standard deviation of 1.0, or as percentage of the value for
a reference line. A trait’s scale can be reversed and traits
can be weighted unequally. Figure 3 provides an example
where barley germplasm that was planted for 2 yr in Aberdeen, ID (irrigated with normal N), are indexed for the
agronomic trait “heading date (Julian)” and the “Malting
bl ake et al .: triti ceae toolbox for sm all gr ains quality” traits “grain protein” and “plump grain” are
selected (Fig. 3A). Since earliness to flower and low protein
are desirable, the reverse ranging is selected (Fig. 3B). In
this example, most of the top performing lines were from
the North Dakota program, with one Montana line and
the ND released variety Pinnacle (PI 643354; Fig. 3C).
3D Cluster by Genotype
T3 now provides three functions to cluster lines by
genotype, a 2D method adopted from THT (Blake et al.,
2012), and two new methods to generate a 3D representation of diversity by clustering. All functions require
a selection of lines that have marker data. Marker data
in columns are centered and missing marker scores
imputed to zero (this is equivalent to mean imputation, Rutkoski et al., 2013). The first function uses the
“partitioning around medoids” clustering algorithm
(Maechler et al., 2013) applied to a Manhattan distance matrix calculated from the marker scores in R (R
Development Core Team, 2008). The user specifies the
number of clusters, k, the algorithm chooses k breeding lines to be medoids, and then identifies which other
lines to associate with each medoid such that the sum of
distances between each line and its associated medoid
is minimized. The second function is experimental and
uses random projections of the marker data to create a
similarity matrix. That matrix is then entered into the
R “hclust” function for clustering. The 3D functions
represent the clustering output by plotting a colored
7
of
10
Fig. 3. Selection index generator in T3 allows users to select traits and trials and rank lines with several options including user-specified
ranking indices.
point for each germplasm line using three-dimensional
coordinates obtained from the first three eigenvectors
of the singular value decomposition of the marker score
matrix. The graphical display allows the user to rotate the
three-dimensional array of colored points freely with the
mouse. The graphical display software is the JavaScript
package X3DOM (Behr et al., 2009).
the database. These data can then be used as trial data for
T3’s GWAS tools. The TCAP project has contributed data
using the Jaz, USB2000 (Ocean Optics, Dunedin, FL),
and CropScan (Cropscan Inc. Rochester MN) instruments. The metadata for the recording instrument and
field layout is stored in the database.
Canopy Spectral Reflectance Plot-Level
Data Management
The Android Field Book is “an open-source application
for electronic data capture that runs on consumer-grade
Android tablets” (Rife and Poland, 2014). If the user is
working with a field trial for which T3 has a spatially
explicit field layout (i.e., the row and column coordinates of plots are given), T3 can create and download a
comma separated values (.csv) file compatible with the
Field Book. Field data can be collected directly onto an
Android tablet with the Field Book application, then
uploaded directly to T3 without further formatting. T3
can obtain spatially explicit field layouts in two ways.
Users can upload field layouts. Alternatively, T3 has
an experimental design interface page that can design
experiments and produce field plans. Designing experiments on T3 directly decreases risks of error from, for
example, mismatches between files uploaded for the field
layout and for the phenotypic measurements and eliminates the need to upload the design separately.
Canopy Spectral Reflectance is a remote sensing technique to evaluate water and nitrogen use in crop plants
(Tucker, 1979; Xue et al., 2004). The CSR data is an example of raw plot level phenotype measurements and are
stored as discreet files. This is in contrast to the means or
adjusted scores submitted with all other types of phenotype data that are integrated into the database on upload.
A typical use of the CSR data is to calculate an index
based on the measurements at specific frequencies. T3
provides tools for calculating 10 commonly used indices.
The wavelength parameters and formula can be modified
to create new indices. Built into the analysis tool is a plot
of each CSR measurement that can be used to identify
questionable data or for adjusting the wavelength parameters of the analysis. After calculating the CSR index,
the results can be saved as a plot level trait. If the user is
logged in, the line means can be calculated and saved in
8
of
10
Integration with the Android Field Book
the pl ant genome

ju ly 2016

vol . 9, no . 2
Discussion and Future Direction
for the Toolbox
The completion the TCAP in 2016 (projected) will not
spell the end of T3. It will persist. Other consortia have
adapted the T3 database schema, including a North
American Oat Research Consortium and the U.S. Wheat
and Barley Scab Initiative (Table 1).
The existing T3s for wheat and barley are in negotiation to merge with GrainGenes (http://wheat.pw.usda.
gov; accessed 4 Dec. 2012; verified 4 Feb. 2016). GrainGenes is the knowledgebase for the Triticeae and Avena,
and is currently overhauling the database and website
with new funding to bring that resource current (personal communication, Ann Blechl, 2014). GrainGenes
primarily contains genetic and mapping information
from peer-reviewed studies, whereas T3 is a “diversity
base” containing the raw information used in such studies, and is much better designed for storing and delivering phenotype data.
Analytical Tools in Development
Training Population Design
This upgrade will allow the user to specify a set of Nc lines
that are candidates to be in a training population of size Nt
(Nc > Nt) and a set of lines that are the target of prediction.
The analysis will then select the Nt such that the prediction
accuracy on the target set is expected to be maximized.
Definition of Core Germplasm Sets
Similar to the above, the user will define a set of Nc lines
that are candidates to be in a core set of size Nt (Nc > Nt)
thus determining if the variation is present in the core
set, and if this set possesses the ability to predict the performance of other lines.
Multivariate Analysis
Initially, this will be limited to multivariate prediction
methods, rather than GWAS. The user specifies a set of
Nt training lines, a set of traits, and a set of lines that are
the target of prediction. The analysis predicts values for
all traits for the targets.
Improved Handling of GBS Markers and Big Data
T3 currently uploads and can retrieve and process GBS
data. Nevertheless, with the size of the datasets, manipulations are slow and need to be improved.
Experimental Designs with Field Maps
T3 currently allows researchers to design trials with a
modified augmented design that is spatially explicit.
Such designs can be uploaded to the Android Field Book
to facilitate data collection and can also be analyzed
with spatial statistical models. We will expand the list of
experimental designs to randomized complete block and
augmented designs.
bl ake et al .: triti ceae toolbox for sm all gr ains As reference genomes become available, T3 will
incorporate these to anchor the current marker data
for genotype by sequencing and high throughput SNP
arrays, thus providing more confidence for GWAS and
genomic prediction studies. A genomic browser for T3 is
in development and should be available in late 2015.
Acknowledgments
This research was supported by the Agriculture and Food Research Initiative Competitive Grant No. 2011-68002-30029 from the USDA National
Institute of Food and Agriculture. Maintenance and further development
of T3 by GrainGenes is supported by USDA-ARS project 5325-21000014-00, “An Integrated Database and Bioinformatics Resource for Small
Grains.” Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does
not imply recommendation or endorsement by the U.S. Department of
Agriculture. Upon request, all novel materials described in this publication will be made available in a timely manner for noncommercial
research purposes, subject to the requisite permission from any thirdparty owners of all or parts of the material. Obtaining any permissions
will be the responsibility of the requestor.
References
Behr, J., P. Eschler, Y. Jung, and M. Zöllner. 2009. X3DOM: A DOMbased HTML5/X3D integration model. Proc. of the 14th Int. Conf.
on 3D Web Technology, 16–17 June 2009, Darmstadt, Germany. p.
127–135. doi:10.1145/1559764.1559784
Blake, V.C., J.G. Kling, P.M. Hayes, J.-L. Jannink, S.R. Jillella, J. Lee,
D.E. Matthews, S. Chao, T.J. Close, G.J. Muehlbauer, K.P. Smith,
R.P. Wise, and J.A. Dickerson. 2012. The Hordeum Toolbox: The
Barley Coordinated Agricultural Project genotype and phenotype
resource. Plant Gen. 5:81–91. doi:10.3835/plantgenome2012.03.0002
Bradbury, P.J., Z. Zhang, D.E. Kroon, T.M. Casstevens, Y. Ramdoss,
and E.S. Buckler. 2007. TASSEL: Software for association mapping
of complex traits in diverse samples. Bioinformatics 23:2633–2635.
doi:10.1093/bioinformatics/btm308
Carollo, V., D.E. Matthews, G.R. Lazo, T.K. Blake, D.D. Hummel, N.
Lui, D.L. Hane, and O.D. Anderson. 2005. GrainGenes 2.0. An
improved resource for the small-grains community. Plant Physiol.
139:643–651. doi:10.1104/pp.105.064485
Deng, W., D.C. Nickle, G.H. Learn, B. Maust, and J.I. Mullins. 2007.
ViroBLAST: A stand-alone BLAST web server for flexible queries
of multiple databases and user’s datasets. Bioinformatics 23:2334–
2336. doi:10.1093/bioinformatics/btm331
Dubcovsky, J. 2004. Marker-assisted selection in public breeding programs: The wheat experience. Crop Sci. 44:1895–1898. doi:10.2135/
cropsci2004.1895
Elshire, R.J., J.C. Glaubitz, Q. Sun, J.A. Poland, K. Kawamoto, E.S.
Buckler, and S.E. Mitchell. 2011. A robust, simple genotyping-bysequencing (GBS) approach for high diversity species. PLoS ONE
6:e19379. doi:10.1371/journal.pone.0019379
Endelman, J.B. 2011. Ridge regression and other kernels for genomic
selection with R package rrBLUP. Plant Gen. 4:250–255.
doi:10.3835/plantgenome2011.08.0024
Furbank, R.T., and M. Tester. 2011. Phenomics—Technologies to
relieve the phenotyping bottleneck. Trends Plant Sci. 16:635–644.
doi:10.1016/j.tplants.2011.09.005
Hamblin, M.T., E.S. Buckler, and J.-L. Jannink. 2011. Population genetics of genomics-based crop improvement methods. Trends Genet.
27:98–106. doi:10.1016/j.tig.2010.12.003
Hazel, L.N. 1943. The genetic basis for constructing selection indexes.
Genetics 28:476–490.
Howe, D., M. Costanzo, P. Fey, T. Gojobori, L. Hannick, W. Hide, D.P.
Hill, R. Kania, M. Shaeffer, S. St Pierre, S. Twigger, O. White, and
S.Y. Rhee. 2008. Big data: The future of biocuration. Nature 455:47–
50. doi:10.1038/455047a
Ihaka, R., and R. Gentleman. 1996. R: A language for data analysis and
graphics. J. Comput. Graph. Stat. 5:299–314.
9
of
10
Kang, H.M., J.H. Sui, S.K. Service, N.A. Zaitlen, S.-Y. Kong, N.B. Freimer, C. Sabatti and E. Eskin. 2010. Variance component model to
account for sample structure in genome-wide association studies.
Nat. Genet. 42:348–354. doi:10.1038/ng.548
Lawrence, C.J., Q. Dong, M.L. Polacco, T.E. Siegfried, and V. Brendel.
2004. MaizeGDB, the community database for maize genetics and
genomics. Nucleic Acids Res. 32:D393–D397. doi:10.1093/nar/gkh011
Lorenz, A.J., S. Chao, F.G. Asoro, E.L. Heffner, T. Hayashi, H. Iwata,
K.P. Smith, M.E. Sorrells, and J.-L. Jannink. 2011. Genomic selection in plant breeding: Knowledge and prospects. Adv. Agron.
110:78–123.
Maechler, M., P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik. 2013.
Cluster: Cluster analysis basics and extensions. R package v.1.14.4.
Available at: ftp://ftp.osuosl.org/pub/cran/web/packages/cluster/
cluster.pdf (accessed 02 Nov. 2014; verified 4 Feb. 2016).
Marx, V. 2013. Biology: The big challenges of big data. Nature 498:255–
260. doi:10.1038/498255a
Matthews, D.E., V.L. Carollo, G.R. Lazo, and O.D. Anderson. 2003.
GrainGenes, the genome database for small-grain crops. Nucleic
Acids Res. 31:183–186. doi:10.1093/nar/gkg058
Milne, I., P. Shaw, G. Stephen, M. Bayer, L. Cardle, W.T.B. Thomas,
A.J. Flavell, and D. Marshall. 2010. Flapjack—Graphical genotype
visualization. Bioinformatics 26:3133–3134. doi:10.1093/bioinformatics/btq580
Montes, J.M., F. Technow, B.S. Dhillon, F. Mauch, and A.E. Melchinger.
2011. High-throughput non-destructive biomass determination
during early plant development in maize under field conditions.
Field Crops Res. 121:268–273. doi:10.1016/j.fcr.2010.12.017
Poland, J.A., P.J. Brown, M.E. Sorrells, and J.-L. Jannink. 2012. Development of high-density genetic maps for barley and wheat using a
novel two-enzyme genotyping-by-sequencing approach. PLoS One
7(2):E32253. doi:10.1371/journal.pone.0032253
R Development Core Team. 2008. The R foundation for statistical computing, Vienna University of Technology, Vienna, Austria. Available at http://www.R-project.org/ (accessed 02 Nov. 2014; verified 4
Feb. 2016).
10
of
10
Rife, T.W., and J.A. Poland. 2014. Field Book: An open-source application for field data collection on Android. Crop Sci. 54:1624–1627.
doi:10.2135/cropsci2013.08.0579
Rutkoski, J.E., J. Poland, J.-L. Jannink, and M.E. Sorrells. 2013. Imputation of unordered markers and the impact on genomic selection
accuracy. G3: Genes Genomes Genet. 3:427–439. doi:10.1534/
g3.112.005363
Risch, N., and K. Merikangas. 1996. The future of genetic studies of
complex human diseases. Science 273:1516–1517. doi:10.1126/science.273.5281.1516
Sen, T.Z., C.M. Andorf, M.L. Shaeffer, L.C. Harper, M.E. Sparks, J.
Duvick, J.P. Brendel, E. Cannon, D.A. Campbell, and C.J. Lawrence.
2009. MaizeGDB becomes ‘sequence-centric’. Database doi:10.1093/
database/bap020
Szucs, P., V.C. Blake, P.R. Bhat, S. Chao, T.J. Close, A. Cuesta-Marcos,
G.J. Muehlbauer, L. Ramsay, R. Waugh, and P.M. Hayes. 2009.
An integrated resource for barley linkage map and malting quality QTL alignment. Plant Gen. 2:134–140. doi:10.3835/plantgenome2008.01.0005
Tucker, C.J. 1979. Red and photographic infrared linear combinations for monitoring vegetation. Remote Sens. Environ. 8:127–150.
doi:10.1016/0034-4257(79)90013-0
Waugh, R., J.-L. Jannink, G.J. Muehlbauer, and L. Ramsey. 2009. The
emergence of whole genome association scans in barley. Curr. Opin.
Plant Biol. 12:218–222. doi:10.1016/j.pbi.2008.12.007
Wimmer, V., T. Albrecht, H.-J. Auinger, and C.-C. Schön. 2012. synbreed:
A framework for the analysis of genomic prediction data using R.
Bioinformatics 28:2086–2087. doi:10.1093/bioinformatics/bts335
Xue, L., W. Cao, W. Luo, T. Dai, and Y. Zhu. 2004. Monitoring leaf
nitrogen status in rice with canopy spectral reflectance. Agron. J.
96:135–142. doi:10.2134/agronj2004.0135
the pl ant genome

ju ly 2016

vol . 9, no . 2